GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS Accident data exploration.ipynb
¹⁹²⁸ views

Kernel: Python 3.10

Accident data exploration and cleansing

On CPDaaS: Make sure to first insert a "project token"

Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token

Once inserted, execute the cell.

A project token is only available if you followed the prerequesite instructions to create on in your project.

In [ ]:

import warnings
import pandas as pd
import numpy as np
import math
import time
import os
from ibm_watson_studio_lib import access_project_or_space

import matplotlib.pyplot as plt
# matplotlib.patches lets us create colored patches, which we can use for legends in plots
import matplotlib.patches as mpatches
%matplotlib inline

# Get access to the prohject API for CPD on-premises
if "USER_ID" in os.environ :
    wslib = access_project_or_space()

Get the Chicago data

If you already got the dataset in a previous notebook execution, you can get the final dataset from the project in a later cell.

For more information on finding and accessing open datasets, see:

Youtube Byte-Size Data Science:
- 019-Finding Data: Socrata Catalog (youtube video)
- 020-Socrata Datasets (youtube video)
Companion notebooks:
- 019-SocrataDataCatalog.ipynb
- 020-SocrataDataAccess-Spark.ipynb

You can also find information on the dataset used in this notebook at: Chicago Traffic Crashes - Crashes

In [ ]:

# Library used to read datasets
# https://github.com/xmunoz/sodapy
!pip install sodapy 2>&1 >pipsodapy.txt

from sodapy import Socrata

Get a connection to the city of Chicago public data

See:

The Socrata Open Data API allows you to programmatically access a wealth of open data resources from governments, non-profits, and NGOs around the world.

In [ ]:

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofchicago.org", None)

Retrieve the six months before May 15, 2023

Six months of data is sufficient to get a good idea of the state of accidents in Chicago.
This notebook uses upto May 15 to make it consistent from execution to execution.

In [ ]:

from datetime import date
from dateutil.relativedelta import relativedelta

# If we wanted to do today:
# six_months = (date.today() - relativedelta(months=+6)).strftime('%Y-%m-%d')
# We are using a fix date for future comparisons
six_months = (date(2023,5,15) - relativedelta(months=+6)).strftime('%Y-%m-%d')
where = "crash_date > '{}'".format(six_months)

Note:

If the next cell execution fails, try again.

In [ ]:

# The request may timeout. If so, retry it.
# The looping is required since the API returns at most 10000 records per call.
crashes_df = pd.DataFrame(client.get("85ca-t3if", where=where, limit=10000))
offset = 10000
result = client.get("85ca-t3if", where=where, offset=offset, limit=10000)
while (len(result) > 0) :
    # crashes_df = crashes_df.append(pd.DataFrame(result), sort=True)
    crashes_df = pd.concat([crashes_df, pd.DataFrame(result)], ignore_index=True)
    offset += 10000
    result = client.get("85ca-t3if", where=where, offset=offset, limit=10000)

print("Number of records: {}, number of columns: {}".format(crashes_df.shape[0], crashes_df.shape[1]))

Explore the dataset

You already know from the previous cell that there are 52842 records with 49 columns.

Try the following:

DataFrame.head: display the first few records
DataFrame.dtypes: provides the type of each column
DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.nunique: Count number of distinct elements in specified axis.
DataFrame.groupby: Group records by values in a specific column.

Exploration conclusion

There is a lot more that can be done in data exploration depending on how much of the information provided by the records you want to use.

Look at the data: This gives a basic idea of what is in there.
Look at the types in the Pandas dataframe: Reading from Socrata returns "object"s!
Convert some columns: After more analysis, it is better to convert the columns to their appropriate types. This can provide better values in other statistics.
Doing a count of non-null values: tells us that some columns include too few values to be useful
Looking at min/max values: Shows the range of values in each column.
For example, seeing a minimum speed limit of 0 seems suspicious.
Number of unique values in a column: Can identify or justify if a column contains categorical values
Number of each categorical values: How balanced are the values?
The traffic_control_device column has 28729 values set to "NO CONTROLS". That's over 50% of the values!
Seeing all the categorical values can show issues. The posted_speed_limit column includes: 0, 1, 2, 3, 5, 8, 9, 23...
What should be done with those? Aggregate to the closest "standard" value? Ignore them?

This lab uses the latitude and longitude and adds injuries_fatal and injuries_total. The exploration shows that some rows do not include latitude and longitude (479 records). They must be removed.

Get only accidents with longitude/latitude

Remove records without latitude and longitude
Use only a few columns
Convert them to their proper types

In [ ]:

# Has to be a better way to do this...
# Select a few columns
crashes_df = crashes_df[['injuries_fatal','injuries_total','latitude','longitude']]

# convert 'injuries_fatal' and 'injuries_total' to float otherwide, int causes problems.
crashes_df = crashes_df.astype({'injuries_fatal': float, 'injuries_total': float,
                                'latitude': float, 'longitude': float})

crashes_df = crashes_df.dropna() # Remove missing values

# make sure it includes only rows with non-zero longitude and latitude
crashes_df = crashes_df[crashes_df['longitude'] != 0]
crashes_df = crashes_df[crashes_df['latitude'] != 0]

print("Number of crashes: {}".format(crashes_df.shape[0]))

Save the data to the project

This way we can avoid re-reading the data from the Chicago site

In [ ]:

crashes_df.to_csv("ChicagoCrashes.csv", index=False)
res = wslib.upload_file('ChicagoCrashes.csv')
print("File {} uploaded".format(res['name']))

Read the data from the project

If you are returning, you can simply read the local file instead of going back to Chicago

In [ ]:

body = wslib.load_data("ChicagoCrashes.csv")
crashes_df = pd.read_csv(body)
crashes_df.head()

Continue here after getting the final crashes_df

How can you know if the data has a decent distribution?

Latitude and longitude provide location information. This is not the same as X and Y coordinates but considering the relatively small area covered by the Chicago area, you can treat them as equivalent.

You can get a good idea of the distribution through a scatter plot.

For more information on spatial data, look at Byte-Size Data Science:

Divide dataset into accident categories: fatal, non-fatal but with injuries, none of the above

This will give us a better idea of the overall accident picture

In [ ]:

killed_df = crashes_df[crashes_df['injuries_fatal']>0]
injured_df = crashes_df[np.logical_and(crashes_df['injuries_total']>0, crashes_df['injuries_fatal']==0)]
# killed_or_injured_df = killed_df.append(injured_df)
killed_or_injured_df = pd.concat([killed_df, injured_df], ignore_index=True)
nothing_df = crashes_df[np.logical_and(crashes_df['injuries_fatal']==0, crashes_df['injuries_total']==0)]

print("Number of records: {}".format(crashes_df.shape[0]))
print("Number of fatal accidents: {}".format(killed_df.shape[0]))
print("Number of injury accidents: {}".format(injured_df.shape[0]))
print("Number of no-injury accidents: {}".format(nothing_df.shape[0]))

crashes_df.describe()

Scatterplot

Create a visualization of the accidents. Note that this is not a map !

Having a graphical representation of our data can give us some insights on how to proceed.

In [ ]:

# Use calculated values for the plot limits (border)
minlong = crashes_df['longitude'].min(axis=0) - 0.005
maxlong = crashes_df['longitude'].max(axis=0) + 0.005
minlat = crashes_df['latitude'].min(axis=0) - 0.005
maxlat = crashes_df['latitude'].max(axis=0) + 0.005
print("min, max, longitude, latitude: {}, {}, {}, {}".format(minlong,maxlong,minlat,maxlat))

In [ ]:

nb_rows = 1
nb_plots = 2

fig, axes = plt.subplots(nrows=nb_rows, ncols=2)
fig.set_figheight(6)
fig.set_figwidth(18)

axes[0].scatter(crashes_df.longitude, crashes_df.latitude, color='darkseagreen', alpha=0.05, s=2)
axes[0].title.set_text('Motor Vehicle Accidents in Chicago (last six months)')
axes[0].set_xlabel('Longitude', labelpad = 5)
axes[0].set_ylabel('latitude', labelpad = 5)

axes[1].scatter(nothing_df.longitude, nothing_df.latitude, color='blue', alpha=0.04, s=2)
axes[1].scatter(injured_df.longitude, injured_df.latitude, color='yellow', alpha=0.12, s=2)
axes[1].scatter(killed_df.longitude, killed_df.latitude, color='red', alpha=1, s=2)

#create legend
blue_patch = mpatches.Patch( label='car body damage', color='blue')
yellow_patch = mpatches.Patch(color='yellow', label='personal injury')
red_patch = mpatches.Patch(color='red', label='lethal accidents')
axes[1].legend([blue_patch, yellow_patch, red_patch],('car body damage', 'personal injury', 'fatal accidents'), 
           loc='upper left', prop={'size':10})
axes[1].title.set_text('Severity of Motor Vehicle Collisions in Chicago')
axes[1].set_xlabel('Longitude', labelpad = 5)
axes[1].set_ylabel('latitude', labelpad = 5)
plt.show()

Conclusion

You can see that the accidents are well distributed to the point that the scaater plot almost simulate a map of the chicago streets. The resulting data is what you need to move forward.

So much more exploration could have been done. This notebook gives a good feel of what should be done with data before using it.

Author

Jacques Roy is a member of the IBM Enablement for Data and AI