Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
CloudPak-Outcomes
GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS Accident data exploration.ipynb
1928 views
Kernel: Python 3.10

Accident data exploration and cleansing

On CPDaaS: Make sure to first insert a "project token"

Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token

Once inserted, execute the cell.

A project token is only available if you followed the prerequesite instructions to create on in your project.

import warnings import pandas as pd import numpy as np import math import time import os from ibm_watson_studio_lib import access_project_or_space import matplotlib.pyplot as plt # matplotlib.patches lets us create colored patches, which we can use for legends in plots import matplotlib.patches as mpatches %matplotlib inline # Get access to the prohject API for CPD on-premises if "USER_ID" in os.environ : wslib = access_project_or_space()

Get the Chicago data

If you already got the dataset in a previous notebook execution, you can get the final dataset from the project in a later cell.

For more information on finding and accessing open datasets, see:

You can also find information on the dataset used in this notebook at: Chicago Traffic Crashes - Crashes

# Library used to read datasets # https://github.com/xmunoz/sodapy !pip install sodapy 2>&1 >pipsodapy.txt from sodapy import Socrata

Get a connection to the city of Chicago public data

See:

The Socrata Open Data API allows you to programmatically access a wealth of open data resources from governments, non-profits, and NGOs around the world.

# Unauthenticated client only works with public data sets. Note 'None' # in place of application token, and no username or password: client = Socrata("data.cityofchicago.org", None)

Retrieve the six months before May 15, 2023

Six months of data is sufficient to get a good idea of the state of accidents in Chicago.
This notebook uses upto May 15 to make it consistent from execution to execution.

from datetime import date from dateutil.relativedelta import relativedelta # If we wanted to do today: # six_months = (date.today() - relativedelta(months=+6)).strftime('%Y-%m-%d') # We are using a fix date for future comparisons six_months = (date(2023,5,15) - relativedelta(months=+6)).strftime('%Y-%m-%d') where = "crash_date > '{}'".format(six_months)

Note:

If the next cell execution fails, try again.

# The request may timeout. If so, retry it. # The looping is required since the API returns at most 10000 records per call. crashes_df = pd.DataFrame(client.get("85ca-t3if", where=where, limit=10000)) offset = 10000 result = client.get("85ca-t3if", where=where, offset=offset, limit=10000) while (len(result) > 0) : # crashes_df = crashes_df.append(pd.DataFrame(result), sort=True) crashes_df = pd.concat([crashes_df, pd.DataFrame(result)], ignore_index=True) offset += 10000 result = client.get("85ca-t3if", where=where, offset=offset, limit=10000) print("Number of records: {}, number of columns: {}".format(crashes_df.shape[0], crashes_df.shape[1]))

Explore the dataset

You already know from the previous cell that there are 52842 records with 49 columns.

Try the following:

  • DataFrame.head: display the first few records

  • DataFrame.dtypes: provides the type of each column

  • DataFrame.count: Count number of non-NA/null observations.

  • DataFrame.max: Maximum of the values in the object.

  • DataFrame.min: Minimum of the values in the object.

  • DataFrame.nunique: Count number of distinct elements in specified axis.

  • DataFrame.groupby: Group records by values in a specific column.

See also Byte-Size Data Science:

# Display a few records
# Look at the types of each column
# Convert a few columns # Convert the two datetime columns to the proper type and the speed limit crashes_df['crash_date'] = \ crashes_df['crash_date'].apply(pd.to_datetime, infer_datetime_format=True, errors='coerce') crashes_df['date_police_notified'] = \ crashes_df['date_police_notified'].apply(pd.to_datetime, infer_datetime_format=True, errors='coerce') crashes_df['posted_speed_limit'] = crashes_df['posted_speed_limit'].astype(int)
# How many non-null values in each column? # If the count is low, the column is likely useless
# Look at the smallest values in each column # Keep in mind that most columns are treated as strings. # A conversion to a proper type would be more appropriate
# Look at the largest values in each column
# Look at the number of unique values for columns: # posted_speed_limit, traffic_control_device, weather_condition, and roadway_surface_cond
# How many of each unique value in 'posted_speed_limit' ?
# How many of each unique value in 'traffic_control_device' ?

Exploration conclusion

There is a lot more that can be done in data exploration depending on how much of the information provided by the records you want to use.

  • Look at the data: This gives a basic idea of what is in there.

  • Look at the types in the Pandas dataframe: Reading from Socrata returns "object"s!

  • Convert some columns: After more analysis, it is better to convert the columns to their appropriate types. This can provide better values in other statistics.

  • Doing a count of non-null values: tells us that some columns include too few values to be useful

  • Looking at min/max values: Shows the range of values in each column.
    For example, seeing a minimum speed limit of 0 seems suspicious.

  • Number of unique values in a column: Can identify or justify if a column contains categorical values

  • Number of each categorical values: How balanced are the values?
    The traffic_control_device column has 28729 values set to "NO CONTROLS". That's over 50% of the values!
    Seeing all the categorical values can show issues. The posted_speed_limit column includes: 0, 1, 2, 3, 5, 8, 9, 23...
    What should be done with those? Aggregate to the closest "standard" value? Ignore them?

This lab uses the latitude and longitude and adds injuries_fatal and injuries_total. The exploration shows that some rows do not include latitude and longitude (479 records). They must be removed.

Get only accidents with longitude/latitude

  • Remove records without latitude and longitude

  • Use only a few columns

  • Convert them to their proper types

# Has to be a better way to do this... # Select a few columns crashes_df = crashes_df[['injuries_fatal','injuries_total','latitude','longitude']] # convert 'injuries_fatal' and 'injuries_total' to float otherwide, int causes problems. crashes_df = crashes_df.astype({'injuries_fatal': float, 'injuries_total': float, 'latitude': float, 'longitude': float}) crashes_df = crashes_df.dropna() # Remove missing values # make sure it includes only rows with non-zero longitude and latitude crashes_df = crashes_df[crashes_df['longitude'] != 0] crashes_df = crashes_df[crashes_df['latitude'] != 0] print("Number of crashes: {}".format(crashes_df.shape[0]))

Save the data to the project

This way we can avoid re-reading the data from the Chicago site

crashes_df.to_csv("ChicagoCrashes.csv", index=False) res = wslib.upload_file('ChicagoCrashes.csv') print("File {} uploaded".format(res['name']))

Read the data from the project

If you are returning, you can simply read the local file instead of going back to Chicago

body = wslib.load_data("ChicagoCrashes.csv") crashes_df = pd.read_csv(body) crashes_df.head()

Continue here after getting the final crashes_df

How can you know if the data has a decent distribution?

Latitude and longitude provide location information. This is not the same as X and Y coordinates but considering the relatively small area covered by the Chicago area, you can treat them as equivalent.

You can get a good idea of the distribution through a scatter plot.

For more information on spatial data, look at Byte-Size Data Science:

Divide dataset into accident categories: fatal, non-fatal but with injuries, none of the above

This will give us a better idea of the overall accident picture

killed_df = crashes_df[crashes_df['injuries_fatal']>0] injured_df = crashes_df[np.logical_and(crashes_df['injuries_total']>0, crashes_df['injuries_fatal']==0)] # killed_or_injured_df = killed_df.append(injured_df) killed_or_injured_df = pd.concat([killed_df, injured_df], ignore_index=True) nothing_df = crashes_df[np.logical_and(crashes_df['injuries_fatal']==0, crashes_df['injuries_total']==0)] print("Number of records: {}".format(crashes_df.shape[0])) print("Number of fatal accidents: {}".format(killed_df.shape[0])) print("Number of injury accidents: {}".format(injured_df.shape[0])) print("Number of no-injury accidents: {}".format(nothing_df.shape[0])) crashes_df.describe()

Scatterplot

Create a visualization of the accidents. Note that this is not a map !

Having a graphical representation of our data can give us some insights on how to proceed.

# Use calculated values for the plot limits (border) minlong = crashes_df['longitude'].min(axis=0) - 0.005 maxlong = crashes_df['longitude'].max(axis=0) + 0.005 minlat = crashes_df['latitude'].min(axis=0) - 0.005 maxlat = crashes_df['latitude'].max(axis=0) + 0.005 print("min, max, longitude, latitude: {}, {}, {}, {}".format(minlong,maxlong,minlat,maxlat))
nb_rows = 1 nb_plots = 2 fig, axes = plt.subplots(nrows=nb_rows, ncols=2) fig.set_figheight(6) fig.set_figwidth(18) axes[0].scatter(crashes_df.longitude, crashes_df.latitude, color='darkseagreen', alpha=0.05, s=2) axes[0].title.set_text('Motor Vehicle Accidents in Chicago (last six months)') axes[0].set_xlabel('Longitude', labelpad = 5) axes[0].set_ylabel('latitude', labelpad = 5) axes[1].scatter(nothing_df.longitude, nothing_df.latitude, color='blue', alpha=0.04, s=2) axes[1].scatter(injured_df.longitude, injured_df.latitude, color='yellow', alpha=0.12, s=2) axes[1].scatter(killed_df.longitude, killed_df.latitude, color='red', alpha=1, s=2) #create legend blue_patch = mpatches.Patch( label='car body damage', color='blue') yellow_patch = mpatches.Patch(color='yellow', label='personal injury') red_patch = mpatches.Patch(color='red', label='lethal accidents') axes[1].legend([blue_patch, yellow_patch, red_patch],('car body damage', 'personal injury', 'fatal accidents'), loc='upper left', prop={'size':10}) axes[1].title.set_text('Severity of Motor Vehicle Collisions in Chicago') axes[1].set_xlabel('Longitude', labelpad = 5) axes[1].set_ylabel('latitude', labelpad = 5) plt.show()

Conclusion

You can see that the accidents are well distributed to the point that the scaater plot almost simulate a map of the chicago streets. The resulting data is what you need to move forward.

So much more exploration could have been done. This notebook gives a good feel of what should be done with data before using it.

Author

Jacques Roy is a member of the IBM Enablement for Data and AI

Copyright © 2023. This notebook and its source code are released under the terms of the MIT License.