Data extraction -- California Dept. of Education Spreadsheets

A set of examples on how to extract machine-readable data from the raw, official sources. No pandas needed, just requests and regex and xlrd (for Excel spreadsheets)

(in progress)

File system setup

In [1]:

from os import makedirs
from os.path import dirname, join
DATA_DIR = join('data', 'schools')

In [2]:

import csv
import re
from xlrd import open_workbook

California SAT score results, 2014

http://www.cde.ca.gov/ds/sp/ai/

http://www.cde.ca.gov/ds/sp/ai/reclayoutsat.asp

http://www.cde.ca.gov/ds/sp/ai/documents/sat14.xls

Parsing and wrangling SAT data file

In [3]:

HEADER_NAMES = ['cds', 'record_type', 'school_name', 'district_name',
                'county_name', 'enrollment_9_12', 'number_of_test_takers',
                'avg_reading_score', 'avg_math_score', 'avg_writing_score',
                'number_scores_gte_1500', 'percent_scores_gte_1500']

source_filename = join(DATA_DIR, 'raw', 'sat14.xls')
dest_filename = join(DATA_DIR, 'extracted', 'sat-2014.csv')
makedirs(dirname(dest_filename), exist_ok=True)

book = open_workbook(source_filename)
sheet = book.sheets()[0]
first_row_idx = sheet.col_values(0).index('CDS') + 1
datarows =  [sheet.row_values(i) for i in range(first_row_idx, sheet.nrows)]
with open(dest_filename, 'w') as wf:
    c = csv.writer(wf)
    c.writerow(HEADER_NAMES)
    c.writerows(datarows)
    print("Wrote", len(datarows), 'rows to', dest_filename)

Out[3]:

Wrote 2478 rows to data/schools/extracted/sat-2014.csv

California student poverty/free-or-reduced-price-lunch data, 2014

TKTK

http://www.cde.ca.gov/ds/sd/sd/filessp.asp

http://www.cde.ca.gov/ds/sd/sd/documents/frpm1314.xls

http://www.cde.ca.gov/ds/sd/sd/filessp1314.asp

Parsing and wrangling free-and-reduced lunch data file

In [4]:

frpm_headers = ['academic_year', 'county_code', 'district_code', 'school_code',
                'county_name', 'district_name', 'school_name', 'is_nslp_provision_2or3',
                'charter_school_number', 'charter_funding_type', 'low_grade', 'high_grade',

                'enrollment_k12', 'unadjusted_free_meal_count_k12', 'adjusted_free_meal_count_k12',
                'adjusted_pct_eligible_free_k12', 'unadjusted_frpm_count_k12', 'adjusted_frpm_count_k12',
                'adjusted_pct_eligible_frpm_k12',

                'enrollment_5to17', 'unadjusted_free_meal_count_5to17', 'adjusted_free_meal_count_5to17',
                'adjusted_pct_eligible_free_5to17', 'unadjusted_frpm_count_5to17', 'adjusted_frpm_count_5to17',
                'adjusted_pct_eligible_frpm_5to17',
                'calpads_certification_status'
               ]

sheet_name = 'FRPM School-Level Data ' # ahhh...love those trailing spaces...
source_filename = join(DATA_DIR, 'raw', 'frpm1314.xls')
dest_filename = join(DATA_DIR, 'extracted', 'frpm-2014.csv')
makedirs(dirname(dest_filename), exist_ok=True)

book = open_workbook(source_filename)
sheet = book.sheet_by_name(sheet_name)
datarows = [sheet.row_values(i) for i in range(1, sheet.nrows)]

with open(dest_filename, 'w') as wf:
    c = csv.writer(wf)
    c.writerow(['cds'] + frpm_headers)
    for d in datarows:
        # manually create and prepend CDS code 
        # which is the concatenation of county, district, school codes
        d.insert(0, ''.join(d[1:4]))
        c.writerow(d)
    print("Wrote", len(datarows), 'rows to', dest_filename)

Out[4]:

Wrote 10361 rows to data/schools/extracted/frpm-2014.csv

In [ ]:

Data extraction -- California Dept. of Education Spreadsheets

File system setup

California SAT score results, 2014

Parsing and wrangling SAT data file

California student poverty/free-or-reduced-price-lunch data, 2014

Parsing and wrangling free-and-reduced lunch data file

Product

Resources

Company