GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS Accident clusters.ipynb
¹⁹²⁸ views

Kernel: Python 3.10

Finding accident clusters

CPDaaS: Make sure to first insert a "project token"

Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token

Once inserted, execute the cell.

A project token is only available if you followed the prerequesite instructions to create one in your project.

In [ ]:

import warnings
import pandas as pd
import numpy as np
import os
from ibm_watson_studio_lib import access_project_or_space

import matplotlib.pyplot as plt
# matplotlib.patches lets us create colored patches, which we can use for legends in plots
import matplotlib.patches as mpatches

%matplotlib inline

# Get access to the prohject API for CPD on-premises
if "USER_ID" in os.environ :
    wslib = access_project_or_space()

In [ ]:

# Install folium for map rendering
!pip install folium 2>&1 >foliumpip.out

import folium

Read the ChicagoCrashes.csv file

The Chicago crashes were collected in a previous exercise and stored in a file in the project after being reduces to the cleansed data required.

In [ ]:

body = wslib.load_data("ChicagoCrashes.csv")
crashes_df = pd.read_csv(body)
crashes_df.head()

Divide dataset into accident categories: fatal, non-fatal but with injuries, none of the above

We'll need different sets of data based on if there are injuries or not.

In [ ]:

killed_df = crashes_df[crashes_df['injuries_fatal']>0]
injured_df = crashes_df[np.logical_and(crashes_df['injuries_total']>0, crashes_df['injuries_fatal']==0)]
# killed_or_injured_df = killed_df.append(injured_df) # "append" has been deprecated in favor of concat
killed_or_injured_df = pd.concat([killed_df, injured_df], ignore_index=True)
nothing_df = crashes_df[np.logical_and(crashes_df['injuries_fatal']==0, crashes_df['injuries_total']==0)]

print("Number of records: {}".format(crashes_df.shape[0]))
print("Number of fatal accidents: {}".format(killed_df.shape[0]))
print("Number of injury accidents: {}".format(injured_df.shape[0]))
print("Number of no-injury accidents: {}".format(nothing_df.shape[0]))

Finding clusters of accidents

There are multiple ways to cluster data based on similarities. This notebook limits itself to trying the DBSCAN and the Agglomerative Clustering algorithms. This demonstrates the need to try multiple methods before deciding on a final solution.

For more information on clustering, see:

Byte-Size Data Science Youtube videos:
- 76 clustering video
- 77 DBSCAN video
Byte-Size Data Science accompanying Notebooks:
- 076 Clustering
- 077 DBSCAN
sklearn.cluster.DBSCAN
sklearn.cluster.AgglomerativeClustering

Find the clusters with DBSCAN

DBSCAN eps parameter is used to identify maximum distance between two samples for one to be considered as in the neighborhood of the other.

Distances: In the Chicago area, the value 0.00015 represents roughly:

Horizontal (longitudinal) distance: 40 feet
Vertical (latitudinal) distance: 54 feet
Diagonal distance: 68 feet

In [ ]:

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

In [ ]:

# The eps value of 0.0015 was chosen after multiple trials
# Trying multiple parameter values is essential in finding the best solution 
db = DBSCAN(eps=0.0015, min_samples=50).fit(crashes_df[['latitude','longitude']])

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)

Display the cluster centers on a map

In [ ]:

results_df = crashes_df[['latitude','longitude']].copy(deep=True)
results_df['cluster'] = db.labels_
results_df = results_df[results_df['cluster'] > -1].reset_index(drop=True) # remove noise points
results_df = results_df.groupby('cluster').agg(latitude=pd.NamedAgg(column='latitude',aggfunc="mean"), 
                                               longitude = pd.NamedAgg(column='longitude',aggfunc="mean"), 
                                               cnt = pd.NamedAgg(column='cluster', aggfunc='count') ).\
                                           reset_index().sort_values('cnt', ascending=False).reset_index(drop=True)
results_df['cluster'] = results_df.index # lower index bigger cluster
minval=results_df['cnt'].min()
maxval= 1 + results_df['cnt'].max()

In [ ]:

# Use a colormap to colorcode the count of accidents in each cluster
# It's difficult to find a good colormap to use.
# see: https://matplotlib.org/3.3.3/gallery/color/colormap_reference.html
import matplotlib.cm as cm

colors = cm.get_cmap('viridis', maxval)(range(maxval),bytes=True)

rgbcolors = []
for v in colors :
    col = np.floor(v * 255)
    r = int(col[0])
    g = int(col[1])
    b = int(col[2])
    rgbcolors.append('#' + '{0:#08x}'.format(((r * 65536) + (g * 256) + b))[2:])

In [ ]:

# Display the average center of each group
latlong = results_df[['latitude','longitude']].mean(axis=0) # To center the map
chi_map = folium.Map(location=[latlong[0], latlong[1]], zoom_start=10, width="90%", height="90%")

# Since the dataframe is ordered, we can display the top 20 only
for idx, coord in results_df[0:20].iterrows():
    tooltip_content="Cluster: {0}, count: {1}".format(coord['cluster'].astype(int),coord['cnt'].astype(int) )
    folium.Circle(radius=500,
                  location=[coord['latitude'], coord['longitude']],
                  # popup=row.hgroup,
                  color=rgbcolors[coord['cnt'].astype(int) ],
                  tooltip=tooltip_content,
                  fill=True,
                  fill_color=rgbcolors[coord['cnt'].astype(int) ]
    ).add_to(chi_map)
    
chi_map

DBSCAN conclusion

The DBSCAN algorithm requires a lot of tuning to arrive at a desired solution. The results are difficult to evaluate.

In this notebook example, there are 66 clusters and 44914 accidents were dismissed as noise. Different parameter values results in different number of clusters and noise values. The top three clusters are close to each other, around downtowm Chicago. You can see this by zooming in the map and hover the cursor over the cluster centers. The cluster numbers represent theorder of the clusters.

It may be possible to figure out a way to group clusters together to get to a top-5 list. Instead, we'lll look at a different approach.

Find the clusters with hierarchical

for the following algorithm, using all the accidents (51,272) is too much for the resources available in the notebook. The notebook restarts the kernel. I assume it runs out of resources.

Instead, we use the accidents with injuries (fatal or not). Around 7,200 records. The reasoning is that these accidents are more significant and should provide more significant clusters.

If you run out of resources in the next step, change your runtime to a larger one such as: Runtime 22.2.on Python 3.10 XS

In [ ]:

# Import objects assuming the k-means section was skipped
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial import distance_matrix 
from sklearn import manifold 
from sklearn.cluster import AgglomerativeClustering 
#from sklearn.datasets.samples_generator import make_blobs
from sklearn.datasets import make_blobs

First pass: get the hierarchy

The first step is to see how the hierarchy is put together. The result is seen in a dendrogram.

Display the hierarchy

In [ ]:

plt.figure(figsize=(9, 9))  
ret = dendrogram(linkage_matrix, truncate_mode='lastp', p=50, no_plot=False, orientation='left')

Comments on the hierarchy

The hierarchy shows how smaller clusters aggregate into larger ones. If we use a vertical line at any point in the hierarchy, we can see how many clusters would be required. It appears at around the horizontal value of 3, we can get exactly 5 clusters.

Since we decided that we wanted five hotspots, it fits our needs.

Second pass: Get 5 clusters

Earlier, we decided to use 5 hotspots. The following cells retirve the five clusters and display them on a map.

The visual result allows us to confirm that the clusters are appropriate.

In [ ]:

ac = AgglomerativeClustering(n_clusters=5,distance_threshold=None)
clusters = ac.fit(killed_or_injured_df[['longitude','latitude']])
print("n_clusters_: {}\nn_leaves_: {}".format(clusters.n_clusters_, clusters.n_leaves_))
print("n_connected_components_: {}\nn_features_in_: {}".format(clusters.n_connected_components_, clusters.n_features_in_))

In [ ]:

# Add the labels to the accidents and aggregate by cluster
results_df = killed_or_injured_df[['latitude','longitude']].copy(deep=True)
results_df['cluster'] = clusters.labels_
results_df = results_df[results_df['cluster'] > -1].reset_index(drop=True) # remove noise points
results_df = results_df.groupby('cluster').agg(latitude=pd.NamedAgg(column='latitude',aggfunc="mean"), 
                                               longitude = pd.NamedAgg(column='longitude',aggfunc="mean"), 
                                               cnt = pd.NamedAgg(column='cluster', aggfunc='count') ).\
                                           reset_index().sort_values('cnt', ascending=False).reset_index(drop=True)
results_df['cluster'] = results_df.index # lower index bigger cluster
minval=results_df['cnt'].min()
maxval= 1 + results_df['cnt'].max()

In [ ]:

import matplotlib.cm as cm

colors = cm.get_cmap('viridis', maxval)(range(maxval),bytes=True)

rgbcolors = []
for v in colors :
    col = np.floor(v * 255)
    r = int(col[0])
    g = int(col[1])
    b = int(col[2])
    rgbcolors.append('#' + '{0:#08x}'.format(((r * 65536) + (g * 256) + b))[2:])

In [ ]:

# Display the average center of each group
latlong = results_df[['latitude','longitude']].mean(axis=0) # To center the map
chi_map = folium.Map(location=[latlong[0], latlong[1]], zoom_start=10, width="90%", height="90%")

# The dataframe is ordered
for idx, coord in results_df.iterrows():
    tooltip_content="Cluster: {0}, count: {1}".format(coord['cluster'].astype(int),coord['cnt'].astype(int) )
    folium.Circle(radius=500,
                  location=[coord['latitude'], coord['longitude']],
                  # popup=row.hgroup,
                  color=rgbcolors[coord['cnt'].astype(int) ],
                  tooltip=tooltip_content,
                  fill=True,
                  fill_color=rgbcolors[coord['cnt'].astype(int) ]
    ).add_to(chi_map)
    
chi_map

Hover over the results

If you hover your cursor over a cluster center, you can see the cluster number and the nuber of accidents attached to it.

In [ ]:

# Show the cluster information
# We can se the balance in the clusters by looking at the cnt column
results_df.head()

Save the cluster information to a file

If you've spent too long through the notebook, the wslib.upload operation may fail due to the expiration of the token. To refresh the connection, go back up and execute the cell just before "Read the ChicagoCrashes.csv file".

The cell ends with: wslib = access_project_or_space(params)

This will retrieve a new token and re-create the wslib client.

In [ ]:

results_df.to_csv("ClusterRecords.csv", index=False)
res = wslib.upload_file('ClusterRecords.csv')
print("File {} uploaded".format(res['name']))

Clustering conclusion

Many projects may require more thant straightforward supervised learning models.

This notebook demonstrate how "full-code" can be used to work with open-source algorithms to get to a solution. It does not pretend to have gotten the optimal solution, just a possible one that appears promising.

It also shows the difficulty of choosing an algorithm and evaluating the results. Data science is as much of an art as it is a science. It relies on the expertise and experience of data scientists.

Author

Jacques Roy is a member of the IBM Enablement for Data and AI

In [ ]: