Path: blob/master/lessons/lesson_12/python-notebooks-data-wrangling/Data-Extraction--CDE-API-fixed-width.ipynb
1904 views
Kernel: Python 3
01100170109835SE DFAME Public Charter Alameda County Office of EducaAlameda 986 777 1020 769 918 781 776 775 105 735 93 712 78 738 728 728 3 8 10 210 840 220 844 190 872 852 851 16 885 24 857 23 804 849 845 111 706 114 733 105 736 725 725 21 772 17 754 14 797 774 773 501 772 519 757 483 757 762 762 1 0 0 218 717 422 707 448 735 720 721 404 748 373 722 407 753 741 742 69 629 77 627 81 629 628 628 01100170112607SH DEnvision Academy for Arts & Technology Alameda County Office of EducaAlameda 200 643 221 605 226 669 639 639 123 620 104 568 107 635 608 609 2 8 4 4 4 3 0 1 2 60 657 81 641 88 696 665 666 0 2 2 8 11 569 11 672 1 9 7 150 630 167 603 180 672 635 636 26 611 52 669 56 647 642 649 19 524 22 431 29 531 495 498 01100170118489SE DAspire California College Preparatory AcAlameda County Office of EducaAlameda 158 825 265 790 265 792 802 799 64 833 75 797 82 786 805 803 0 0 0 2 2 3 5 5 4 87 817 181 781 174 790 796 792 0 1 0 0 0 1 0 1 1 87 817 209 775 210 777 790 783 46 774 110 716 106 752 747 741 13 591 18 611 22 494 565 558
In [1]:
from lxml import html as htmlparser from os import makedirs from os.path import basename, exists, join, splitext from shutil import unpack_archive import csv import requests RECORD_LAYOUT_URL = 'http://www.cde.ca.gov/ta/ac/ap/reclayoutApiAvg.asp' RAW_DATA_ZIP_URL = 'http://www3.cde.ca.gov/researchfiles/api/14avgtx.zip' RAW_DATA_DIR = join('data', 'schools', 'raw') RAW_DATA_ZIP_FILENAME = join(RAW_DATA_DIR, basename(RAW_DATA_ZIP_URL)) # the text file has the same name as the zip, just different extension RAW_DATA_TXT_FILENAME = splitext(RAW_DATA_ZIP_FILENAME)[0] + '.txt' makedirs(RAW_DATA_DIR, exist_ok=True)
In [2]:
# save and extract the zip file to the raw directory if not exists(RAW_DATA_ZIP_FILENAME): resp = requests.get(RAW_DATA_ZIP_URL) with open(RAW_DATA_ZIP_FILENAME, 'wb') as wf: wf.write(resp.content) if not exists(RAW_DATA_TXT_FILENAME): unpack_archive(RAW_DATA_ZIP_FILENAME, extract_dir=RAW_DATA_DIR)
In [3]:
# prepare the extracted datafile and directory EXTRACTED_DATA_DIR = join('data', 'schools', 'extracted') # the extracted file has same basename but a .csv extension EXTRACTED_DATA_FILENAME = join(EXTRACTED_DATA_DIR, '{}.csv'.format(splitext(basename(RAW_DATA_TXT_FILENAME))[0])) makedirs(EXTRACTED_DATA_DIR, exist_ok=True)
Parsing the record layout
Now that we have the data downloaded and ready to unpack and parse, let's visit the Record Layout for the 3-Year Average API Data File get the metadata behind the fixed-width columns.
Here's a screenshot of the page:
The HTML is pretty straightforward: we need the second and fourth columns -- Field Name and Width
<table width="95%" border="1" cellpadding="3" cellspacing="0"> <tbody><tr align="center" valign="middle"> <th nowrap=""><div align="center">Field #</div></th> <th nowrap=""><div align="center"> <p>Field Name </p> </div></th> <th nowrap=""><div align="center">Type</div></th> <th nowrap=""><div align="center">Width</div></th> <th nowrap=""><div align="center">Description</div></th> </tr> <tr> <td width="6%" align="left" valign="top"> <div align="center">1 </div></td> <td width="13%" align="left" valign="top">CDS </td> <td width="7%" align="left" valign="top">Character </td> <td width="6%" align="left" valign="top"> <div align="center">14 </div></td> <td valign="top" width="68%">County/District/School code </td> </tr> </tbody></table>
In [4]:
# parse the Record Layout webpage resp = requests.get(RECORD_LAYOUT_URL) htmldoc = htmlparser.fromstring(resp.text) # whatever xa, xb = 0,0 # these represent the field boundaries rows = htmldoc.xpath('//tr[td]') field_defs = [tr.xpath('(td[2] | td[4]/div)/text()') for tr in rows] for i, (fieldname, fieldlength) in enumerate(field_defs): xb = xa + int(fieldlength) field_defs[i] = (fieldname.strip(), (xa, xb)) xa = xb
In [7]:
wf = open(EXTRACTED_DATA_FILENAME, 'w') cf = csv.DictWriter(wf, fieldnames=[fd[0] for fd in field_defs]) cf.writeheader() with open(RAW_DATA_TXT_FILENAME, 'r', encoding='ISO-8859-2') as rf: for line in rf: cf.writerow({fieldname: line[x1:x2].strip() for fieldname, (x1, x2) in field_defs}) wf.close()
In [ ]: