GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_12/python-notebooks-data-wrangling/Wrangling--Pivot-Tables-with-Pandas.ipynb
²³⁵⁸ views

Kernel: Python 3

Pivot Tables with Pandas: San Francisco Crime Data

Data background

The San Francisco crime data comes from the city Socrata portal:

https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry

The API call to get the data as CSV, for dates between '2006-01-01' and '2016-04-01':

https://data.sfgov.org/resource/cuks-n6tp.csv?$where=date between '2006-01-01' and '2016-04-01'&ParseError: KaTeX parse error: Expected 'EOF', got '&' at position 85: …ddress,x,y,pdid&̲limit=1000000000

The download is more than 200 MB+. A mirror of the data is included in this repo:

data/sf/raw/bulk-data.sfgov.org--20060101-20160401--cuks-n6tp.csv

An important caveat from the SF data portal:

As of July 19, 2015, the PD District boundaries have been updated through a redistricting process. These new boundaries are not reflected in the dataset yet so you cannot compare data from July 19, 2015 onward to official reports from PD with the Police District column. We are working on an update to the dataset to reflect the updated boundaries starting with data entered July 19 onward. Incidents derived from SFPD Crime Incident Reporting system Updated daily, showing data from 1/1/2003 up until two weeks ago from current date. Please note: San Francisco police have implemented a new system for tracking crime. The dataset included here is still coming from the old system, which is in the process of being retired (a multi-year process). Data included here is no longer the official SFPD data. We will migrate to the new system for DataSF in the upcoming months.

In [101]:

%matplotlib inline
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
DATA_PATH = Path('data', 'sf', 'raw', 'bulk-data.sfgov.org--20060101-20160401--cuks-n6tp.csv')

In [102]:

df = pd.read_csv(DATA_PATH, parse_dates=['date'])
df['year'] = df['date'].dt.year
# filter out 2016 since it's not a full year
df = df[df['year'] < 2016]

In [103]:

# total number of rows
len(df)

Out[103]:

1422525

In [104]:

df.head()

Out[104]:

Simple pivot chart

http://pbpython.com/pandas-pivot-table-explained.html

Let's do a simple pivot: for every year, aggregate the count of records.

In [105]:

crimeagg = df.pivot_table(index='year', aggfunc=len, values='pdid')
# the result
crimeagg

Out[105]:

year
  137853
  137639
  141311
  139860
  133525
  132699
  140858
  152811
  150159
  155810
Name: pdid, dtype: int64

In [106]:

# let's chart it
fig, ax = plt.subplots()
ax.bar(crimeagg.index, crimeagg.values, align='center')
ax.set_xticks(range(2006, 2016))
ax.set_xlim(xmin=2005);

Out[106]:

Pivot by category

A general count of crime is not interesting. Lets dive down into categories

Pivot the dataframe by category, and then by year

In [107]:

catdf = df.pivot_table(index='category', columns='year', values='pdid', aggfunc=len)

In [108]:

catdf.head()

Out[108]:

In [109]:

catdf['delta_2015_2010'] = (catdf[2015] - catdf[2010]) / catdf[2010]
catdf.sort_values('delta_2015_2010')

Out[109]:

In [110]:

# filter the list to types of crime that have had more than a 1000 reported incidents
# total for 2010 and 2015
majorcatdf = catdf[catdf[2015] + catdf[2010] > 1000]
majorcatdf.sort_values('delta_2015_2010')

Out[110]:

Larceny/Thefts

Go back to the original table, now facet by descript

In [111]:

thefts_df = df[df['category'] == 'LARCENY/THEFT']
thefts_pivot = thefts_df.pivot_table(index='descript', columns='year', values='pdid', aggfunc=len)

In [112]:

thefts_pivot['delta_2015_2010'] = (thefts_pivot[2015] - thefts_pivot[2010]) / thefts_pivot[2010]

In [113]:

thefts_pivot.sort_values('delta_2015_2010').head()

Out[113]:

In [114]:

# remove noise
majorthefts_pivot = thefts_pivot[thefts_pivot[2015] + thefts_pivot[2010] > 100]
majorthefts_pivot.sort_values('delta_2015_2010')

Out[114]:

Vehicle thefts

http://www.sfgate.com/crime/article/Car-thefts-surging-in-San-Francisco-6516607.php

In [115]:

autothefts_df = df[df['category'] == 'VEHICLE THEFT']
autothefts_piv = autothefts_df.pivot_table(index='descript', columns='year', values='pdid', aggfunc=len)
autothefts_piv['delta_2015_2010'] = (autothefts_piv[2015] - autothefts_piv[2010]) / autothefts_piv[2010]
major_autothefts_piv = autothefts_piv[autothefts_piv[2015] + autothefts_piv[2010] > 100]

In [116]:

major_autothefts_piv.sort_values('delta_2015_2010')

Out[116]:

By district

In [117]:

v_df = df[df['category'] == 'VEHICLE THEFT']
v_piv = v_df.pivot_table(index='pddistrict', columns='year', values='pdid', aggfunc=len)
v_piv['delta_2014_2010'] = (v_piv[2014] - v_piv[2010]) / v_piv[2010]

In [118]:

v_piv.sort_values('delta_2014_2010', ascending=False)

Out[118]:

In [ ]: