CoCalc -- Data-Extraction--CDE-API-fixed-width.ipynb

GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_12/python-notebooks-data-wrangling/Data-Extraction--CDE-API-fixed-width.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

01100170109835SE  DFAME Public Charter                     Alameda County Office of EducaAlameda             986    777  1020   769  918    781  776  775  105    735  93     712  78     738  728  728  3           8           10                    210    840  220    844  190    872  852  851  16     885  24     857  23     804  849  845  111    706  114    733  105    736  725  725  21     772  17     754  14     797  774  773  501    772  519    757  483    757  762  762  1           0           0                     218    717  422    707  448    735  720  721  404    748  373    722  407    753  741  742  69     629  77     627  81     629  628  628     
01100170112607SH  DEnvision Academy for Arts & Technology  Alameda County Office of EducaAlameda             200    643  221    605  226    669  639  639  123    620  104    568  107    635  608  609  2           8           4                     4           4           3                     0           1           2                     60     657  81     641  88     696  665  666  0           2           2                     8           11     569  11     672            1           9           7                     150    630  167    603  180    672  635  636  26     611  52     669  56     647  642  649  19     524  22     431  29     531  495  498     
01100170118489SE  DAspire California College Preparatory AcAlameda County Office of EducaAlameda             158    825  265    790  265    792  802  799  64     833  75     797  82     786  805  803  0           0           0                     2           2           3                     5           5           4                     87     817  181    781  174    790  796  792  0           1           0                     0           0           1                     0           1           1                     87     817  209    775  210    777  790  783  46     774  110    716  106    752  747  741  13     591  18     611  22     494  565  558

In [1]:

from lxml import html as htmlparser
from os import makedirs
from os.path import basename, exists, join, splitext
from shutil import unpack_archive
import csv
import requests
RECORD_LAYOUT_URL = 'http://www.cde.ca.gov/ta/ac/ap/reclayoutApiAvg.asp'
RAW_DATA_ZIP_URL = 'http://www3.cde.ca.gov/researchfiles/api/14avgtx.zip'
RAW_DATA_DIR = join('data', 'schools', 'raw')
RAW_DATA_ZIP_FILENAME = join(RAW_DATA_DIR, basename(RAW_DATA_ZIP_URL))
# the text file has the same name as the zip, just different extension
RAW_DATA_TXT_FILENAME = splitext(RAW_DATA_ZIP_FILENAME)[0] + '.txt'
makedirs(RAW_DATA_DIR, exist_ok=True)

In [2]:

# save and extract the zip file to the raw directory
if not exists(RAW_DATA_ZIP_FILENAME):
    resp = requests.get(RAW_DATA_ZIP_URL)
    with open(RAW_DATA_ZIP_FILENAME, 'wb') as wf:
        wf.write(resp.content)

if not exists(RAW_DATA_TXT_FILENAME):
    unpack_archive(RAW_DATA_ZIP_FILENAME, extract_dir=RAW_DATA_DIR)

In [3]:

# prepare the extracted datafile and directory
EXTRACTED_DATA_DIR = join('data', 'schools', 'extracted')
# the extracted file has same basename but a .csv extension
EXTRACTED_DATA_FILENAME = join(EXTRACTED_DATA_DIR, '{}.csv'.format(splitext(basename(RAW_DATA_TXT_FILENAME))[0]))
makedirs(EXTRACTED_DATA_DIR, exist_ok=True)

Parsing the record layout

Now that we have the data downloaded and ready to unpack and parse, let's visit the Record Layout for the 3-Year Average API Data File get the metadata behind the fixed-width columns.

Here's a screenshot of the page:

The HTML is pretty straightforward: we need the second and fourth columns -- Field Name and Width

<table width="95%" border="1" cellpadding="3" cellspacing="0">
                <tbody><tr align="center" valign="middle">
                  <th nowrap=""><div align="center">Field 
                  #</div></th>
                  <th nowrap=""><div align="center">
                    <p>Field Name </p>
                  </div></th>
                  <th nowrap=""><div align="center">Type</div></th>
                  <th nowrap=""><div align="center">Width</div></th>
                  <th nowrap=""><div align="center">Description</div></th>
                </tr>
                <tr> 
                  <td width="6%" align="left" valign="top"> <div align="center">1 
                  </div></td>
                  <td width="13%" align="left" valign="top">CDS </td>
                  <td width="7%" align="left" valign="top">Character </td>
                  <td width="6%" align="left" valign="top"> <div align="center">14 
                  </div></td>
                  <td valign="top" width="68%">County/District/School code </td>
                </tr>
</tbody></table>

In [4]:

# parse the Record Layout webpage
resp = requests.get(RECORD_LAYOUT_URL)
htmldoc = htmlparser.fromstring(resp.text)
# whatever
xa, xb = 0,0  # these represent the field boundaries
rows = htmldoc.xpath('//tr[td]')
field_defs = [tr.xpath('(td[2] | td[4]/div)/text()') for tr in rows]
for i, (fieldname, fieldlength) in enumerate(field_defs):
    xb = xa + int(fieldlength)
    field_defs[i] = (fieldname.strip(), (xa, xb))
    xa = xb

In [7]:

wf = open(EXTRACTED_DATA_FILENAME, 'w')
cf = csv.DictWriter(wf, fieldnames=[fd[0] for fd in field_defs])
cf.writeheader()
with open(RAW_DATA_TXT_FILENAME, 'r', encoding='ISO-8859-2') as rf:
    for line in rf:
        cf.writerow({fieldname: line[x1:x2].strip() for fieldname, (x1, x2) in field_defs})
wf.close()

In [ ]:

Parsing the record layout

Product

Resources

Company