Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS Accident data exploration.ipynb
1928 views
Accident data exploration and cleansing
On CPDaaS: Make sure to first insert a "project token"
Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token
Once inserted, execute the cell.
A project token is only available if you followed the prerequesite instructions to create on in your project.
Get the Chicago data
If you already got the dataset in a previous notebook execution, you can get the final dataset from the project in a later cell.
For more information on finding and accessing open datasets, see:
Youtube Byte-Size Data Science:
Companion notebooks:
You can also find information on the dataset used in this notebook at: Chicago Traffic Crashes - Crashes
Get a connection to the city of Chicago public data
See:
The Socrata Open Data API allows you to programmatically access a wealth of open data resources from governments, non-profits, and NGOs around the world.
Retrieve the six months before May 15, 2023
Six months of data is sufficient to get a good idea of the state of accidents in Chicago.
This notebook uses upto May 15 to make it consistent from execution to execution.
Note:
If the next cell execution fails, try again.
Explore the dataset
You already know from the previous cell that there are 52842 records with 49 columns.
Try the following:
DataFrame.head: display the first few recordsDataFrame.dtypes: provides the type of each columnDataFrame.count: Count number of non-NA/null observations.DataFrame.max: Maximum of the values in the object.DataFrame.min: Minimum of the values in the object.DataFrame.nunique: Count number of distinct elements in specified axis.DataFrame.groupby: Group records by values in a specific column.
See also Byte-Size Data Science:
Exploration conclusion
There is a lot more that can be done in data exploration depending on how much of the information provided by the records you want to use.
Look at the data: This gives a basic idea of what is in there.
Look at the types in the Pandas dataframe: Reading from Socrata returns "object"s!
Convert some columns: After more analysis, it is better to convert the columns to their appropriate types. This can provide better values in other statistics.
Doing a count of non-null values: tells us that some columns include too few values to be useful
Looking at min/max values: Shows the range of values in each column.
For example, seeing a minimum speed limit of 0 seems suspicious.Number of unique values in a column: Can identify or justify if a column contains categorical values
Number of each categorical values: How balanced are the values?
Thetraffic_control_devicecolumn has 28729 values set to "NO CONTROLS". That's over 50% of the values!
Seeing all the categorical values can show issues. Theposted_speed_limitcolumn includes: 0, 1, 2, 3, 5, 8, 9, 23...
What should be done with those? Aggregate to the closest "standard" value? Ignore them?
This lab uses the latitude and longitude and adds injuries_fatal and injuries_total. The exploration shows that some rows do not include latitude and longitude (479 records). They must be removed.
Get only accidents with longitude/latitude
Remove records without latitude and longitude
Use only a few columns
Convert them to their proper types
Save the data to the project
This way we can avoid re-reading the data from the Chicago site
Read the data from the project
If you are returning, you can simply read the local file instead of going back to Chicago
Continue here after getting the final crashes_df
How can you know if the data has a decent distribution?
Latitude and longitude provide location information. This is not the same as X and Y coordinates but considering the relatively small area covered by the Chicago area, you can treat them as equivalent.
You can get a good idea of the distribution through a scatter plot.
For more information on spatial data, look at Byte-Size Data Science:
Divide dataset into accident categories: fatal, non-fatal but with injuries, none of the above
This will give us a better idea of the overall accident picture
Scatterplot
Create a visualization of the accidents. Note that this is not a map !
Having a graphical representation of our data can give us some insights on how to proceed.
Conclusion
You can see that the accidents are well distributed to the point that the scaater plot almost simulate a map of the chicago streets. The resulting data is what you need to move forward.
So much more exploration could have been done. This notebook gives a good feel of what should be done with data before using it.
Author
Jacques Roy is a member of the IBM Enablement for Data and AI
Copyright © 2023. This notebook and its source code are released under the terms of the MIT License.