Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS Accident clusters.ipynb
1928 views
Finding accident clusters
CPDaaS: Make sure to first insert a "project token"
Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token
Once inserted, execute the cell.
A project token is only available if you followed the prerequesite instructions to create one in your project.
Read the ChicagoCrashes.csv file
The Chicago crashes were collected in a previous exercise and stored in a file in the project after being reduces to the cleansed data required.
Divide dataset into accident categories: fatal, non-fatal but with injuries, none of the above
We'll need different sets of data based on if there are injuries or not.
Finding clusters of accidents
There are multiple ways to cluster data based on similarities. This notebook limits itself to trying the DBSCAN and the Agglomerative Clustering algorithms. This demonstrates the need to try multiple methods before deciding on a final solution.
For more information on clustering, see:
Byte-Size Data Science Youtube videos:
Byte-Size Data Science accompanying Notebooks:
Find the clusters with DBSCAN
DBSCAN eps parameter is used to identify maximum distance between two samples for one to be considered as in the neighborhood of the other.
Distances: In the Chicago area, the value 0.00015 represents roughly:
Horizontal (longitudinal) distance: 40 feet
Vertical (latitudinal) distance: 54 feet
Diagonal distance: 68 feet
Display the cluster centers on a map
DBSCAN conclusion
The DBSCAN algorithm requires a lot of tuning to arrive at a desired solution. The results are difficult to evaluate.
In this notebook example, there are 66 clusters and 44914 accidents were dismissed as noise. Different parameter values results in different number of clusters and noise values. The top three clusters are close to each other, around downtowm Chicago. You can see this by zooming in the map and hover the cursor over the cluster centers. The cluster numbers represent theorder of the clusters.
It may be possible to figure out a way to group clusters together to get to a top-5 list. Instead, we'lll look at a different approach.
Find the clusters with hierarchical
for the following algorithm, using all the accidents (51,272) is too much for the resources available in the notebook. The notebook restarts the kernel. I assume it runs out of resources.
Instead, we use the accidents with injuries (fatal or not). Around 7,200 records. The reasoning is that these accidents are more significant and should provide more significant clusters.
If you run out of resources in the next step, change your runtime to a larger one such as: Runtime 22.2.on Python 3.10 XS
First pass: get the hierarchy
The first step is to see how the hierarchy is put together. The result is seen in a dendrogram.
See also:
Display the hierarchy
Comments on the hierarchy
The hierarchy shows how smaller clusters aggregate into larger ones. If we use a vertical line at any point in the hierarchy, we can see how many clusters would be required. It appears at around the horizontal value of 3, we can get exactly 5 clusters.
Since we decided that we wanted five hotspots, it fits our needs.
Second pass: Get 5 clusters
Earlier, we decided to use 5 hotspots. The following cells retirve the five clusters and display them on a map.
The visual result allows us to confirm that the clusters are appropriate.
Hover over the results
If you hover your cursor over a cluster center, you can see the cluster number and the nuber of accidents attached to it.
Save the cluster information to a file
If you've spent too long through the notebook, the wslib.upload operation may fail due to the expiration of the token. To refresh the connection, go back up and execute the cell just before "Read the ChicagoCrashes.csv file".
The cell ends with: wslib = access_project_or_space(params)
This will retrieve a new token and re-create the wslib client.
Clustering conclusion
Many projects may require more thant straightforward supervised learning models.
This notebook demonstrate how "full-code" can be used to work with open-source algorithms to get to a solution. It does not pretend to have gotten the optimal solution, just a possible one that appears promising.
It also shows the difficulty of choosing an algorithm and evaluating the results. Data science is as much of an art as it is a science. It relies on the expertise and experience of data scientists.
Author
Jacques Roy is a member of the IBM Enablement for Data and AI
Copyright © 2023. This notebook and its source code are released under the terms of the MIT License.