Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_12/python-notebooks-data-wrangling/Wrangling--Pivot-Tables-with-Pandas.ipynb
1904 views
Kernel: Python 3

Pivot Tables with Pandas: San Francisco Crime Data

Data background

The San Francisco crime data comes from the city Socrata portal:

https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry

The API call to get the data as CSV, for dates between '2006-01-01' and '2016-04-01':

https://data.sfgov.org/resource/cuks-n6tp.csv?$where=date between '2006-01-01' and '2016-04-01'&ParseError: KaTeX parse error: Expected 'EOF', got '&' at position 85: …ddress,x,y,pdid&̲limit=1000000000

The download is more than 200 MB+. A mirror of the data is included in this repo:

data/sf/raw/bulk-data.sfgov.org--20060101-20160401--cuks-n6tp.csv

An important caveat from the SF data portal:

As of July 19, 2015, the PD District boundaries have been updated through a redistricting process. These new boundaries are not reflected in the dataset yet so you cannot compare data from July 19, 2015 onward to official reports from PD with the Police District column. We are working on an update to the dataset to reflect the updated boundaries starting with data entered July 19 onward. Incidents derived from SFPD Crime Incident Reporting system Updated daily, showing data from 1/1/2003 up until two weeks ago from current date. Please note: San Francisco police have implemented a new system for tracking crime. The dataset included here is still coming from the old system, which is in the process of being retired (a multi-year process). Data included here is no longer the official SFPD data. We will migrate to the new system for DataSF in the upcoming months.

%matplotlib inline from pathlib import Path import pandas as pd import matplotlib.pyplot as plt DATA_PATH = Path('data', 'sf', 'raw', 'bulk-data.sfgov.org--20060101-20160401--cuks-n6tp.csv')
df = pd.read_csv(DATA_PATH, parse_dates=['date']) df['year'] = df['date'].dt.year # filter out 2016 since it's not a full year df = df[df['year'] < 2016]
# total number of rows len(df)
1422525
df.head()

Simple pivot chart

http://pbpython.com/pandas-pivot-table-explained.html

Let's do a simple pivot: for every year, aggregate the count of records.

crimeagg = df.pivot_table(index='year', aggfunc=len, values='pdid') # the result crimeagg
year 2006 137853 2007 137639 2008 141311 2009 139860 2010 133525 2011 132699 2012 140858 2013 152811 2014 150159 2015 155810 Name: pdid, dtype: int64
# let's chart it fig, ax = plt.subplots() ax.bar(crimeagg.index, crimeagg.values, align='center') ax.set_xticks(range(2006, 2016)) ax.set_xlim(xmin=2005);
Image in a Jupyter notebook

Pivot by category

A general count of crime is not interesting. Lets dive down into categories

Pivot the dataframe by category, and then by year

catdf = df.pivot_table(index='category', columns='year', values='pdid', aggfunc=len)
catdf.head()
catdf['delta_2015_2010'] = (catdf[2015] - catdf[2010]) / catdf[2010] catdf.sort_values('delta_2015_2010')
# filter the list to types of crime that have had more than a 1000 reported incidents # total for 2010 and 2015 majorcatdf = catdf[catdf[2015] + catdf[2010] > 1000] majorcatdf.sort_values('delta_2015_2010')

Larceny/Thefts

Go back to the original table, now facet by descript

thefts_df = df[df['category'] == 'LARCENY/THEFT'] thefts_pivot = thefts_df.pivot_table(index='descript', columns='year', values='pdid', aggfunc=len)
thefts_pivot['delta_2015_2010'] = (thefts_pivot[2015] - thefts_pivot[2010]) / thefts_pivot[2010]
thefts_pivot.sort_values('delta_2015_2010').head()
# remove noise majorthefts_pivot = thefts_pivot[thefts_pivot[2015] + thefts_pivot[2010] > 100] majorthefts_pivot.sort_values('delta_2015_2010')
autothefts_df = df[df['category'] == 'VEHICLE THEFT'] autothefts_piv = autothefts_df.pivot_table(index='descript', columns='year', values='pdid', aggfunc=len) autothefts_piv['delta_2015_2010'] = (autothefts_piv[2015] - autothefts_piv[2010]) / autothefts_piv[2010] major_autothefts_piv = autothefts_piv[autothefts_piv[2015] + autothefts_piv[2010] > 100]
major_autothefts_piv.sort_values('delta_2015_2010')

By district

v_df = df[df['category'] == 'VEHICLE THEFT'] v_piv = v_df.pivot_table(index='pddistrict', columns='year', values='pdid', aggfunc=len) v_piv['delta_2014_2010'] = (v_piv[2014] - v_piv[2010]) / v_piv[2010]
v_piv.sort_values('delta_2014_2010', ascending=False)