Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_12/python-notebooks-data-wrangling/Data-Extraction--CDE-XLS.ipynb
1904 views
Kernel: Python 3

Data extraction -- California Dept. of Education Spreadsheets

A set of examples on how to extract machine-readable data from the raw, official sources. No pandas needed, just requests and regex and xlrd (for Excel spreadsheets)

(in progress)

File system setup

from os import makedirs from os.path import dirname, join DATA_DIR = join('data', 'schools')
import csv import re from xlrd import open_workbook

Parsing and wrangling SAT data file

HEADER_NAMES = ['cds', 'record_type', 'school_name', 'district_name', 'county_name', 'enrollment_9_12', 'number_of_test_takers', 'avg_reading_score', 'avg_math_score', 'avg_writing_score', 'number_scores_gte_1500', 'percent_scores_gte_1500'] source_filename = join(DATA_DIR, 'raw', 'sat14.xls') dest_filename = join(DATA_DIR, 'extracted', 'sat-2014.csv') makedirs(dirname(dest_filename), exist_ok=True) book = open_workbook(source_filename) sheet = book.sheets()[0] first_row_idx = sheet.col_values(0).index('CDS') + 1 datarows = [sheet.row_values(i) for i in range(first_row_idx, sheet.nrows)] with open(dest_filename, 'w') as wf: c = csv.writer(wf) c.writerow(HEADER_NAMES) c.writerows(datarows) print("Wrote", len(datarows), 'rows to', dest_filename)
Wrote 2478 rows to data/schools/extracted/sat-2014.csv

Parsing and wrangling free-and-reduced lunch data file

frpm_headers = ['academic_year', 'county_code', 'district_code', 'school_code', 'county_name', 'district_name', 'school_name', 'is_nslp_provision_2or3', 'charter_school_number', 'charter_funding_type', 'low_grade', 'high_grade', 'enrollment_k12', 'unadjusted_free_meal_count_k12', 'adjusted_free_meal_count_k12', 'adjusted_pct_eligible_free_k12', 'unadjusted_frpm_count_k12', 'adjusted_frpm_count_k12', 'adjusted_pct_eligible_frpm_k12', 'enrollment_5to17', 'unadjusted_free_meal_count_5to17', 'adjusted_free_meal_count_5to17', 'adjusted_pct_eligible_free_5to17', 'unadjusted_frpm_count_5to17', 'adjusted_frpm_count_5to17', 'adjusted_pct_eligible_frpm_5to17', 'calpads_certification_status' ] sheet_name = 'FRPM School-Level Data ' # ahhh...love those trailing spaces... source_filename = join(DATA_DIR, 'raw', 'frpm1314.xls') dest_filename = join(DATA_DIR, 'extracted', 'frpm-2014.csv') makedirs(dirname(dest_filename), exist_ok=True) book = open_workbook(source_filename) sheet = book.sheet_by_name(sheet_name) datarows = [sheet.row_values(i) for i in range(1, sheet.nrows)] with open(dest_filename, 'w') as wf: c = csv.writer(wf) c.writerow(['cds'] + frpm_headers) for d in datarows: # manually create and prepend CDS code # which is the concatenation of county, district, school codes d.insert(0, ''.join(d[1:4])) c.writerow(d) print("Wrote", len(datarows), 'rows to', dest_filename)
Wrote 10361 rows to data/schools/extracted/frpm-2014.csv