Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
CloudPak-Outcomes
GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS Accident clusters.ipynb
1928 views
Kernel: Python 3.10

Finding accident clusters

CPDaaS: Make sure to first insert a "project token"

Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token

Once inserted, execute the cell.

A project token is only available if you followed the prerequesite instructions to create one in your project.

import warnings import pandas as pd import numpy as np import os from ibm_watson_studio_lib import access_project_or_space import matplotlib.pyplot as plt # matplotlib.patches lets us create colored patches, which we can use for legends in plots import matplotlib.patches as mpatches %matplotlib inline # Get access to the prohject API for CPD on-premises if "USER_ID" in os.environ : wslib = access_project_or_space()
# Install folium for map rendering !pip install folium 2>&1 >foliumpip.out import folium

Read the ChicagoCrashes.csv file

The Chicago crashes were collected in a previous exercise and stored in a file in the project after being reduces to the cleansed data required.

body = wslib.load_data("ChicagoCrashes.csv") crashes_df = pd.read_csv(body) crashes_df.head()

Divide dataset into accident categories: fatal, non-fatal but with injuries, none of the above

We'll need different sets of data based on if there are injuries or not.

killed_df = crashes_df[crashes_df['injuries_fatal']>0] injured_df = crashes_df[np.logical_and(crashes_df['injuries_total']>0, crashes_df['injuries_fatal']==0)] # killed_or_injured_df = killed_df.append(injured_df) # "append" has been deprecated in favor of concat killed_or_injured_df = pd.concat([killed_df, injured_df], ignore_index=True) nothing_df = crashes_df[np.logical_and(crashes_df['injuries_fatal']==0, crashes_df['injuries_total']==0)] print("Number of records: {}".format(crashes_df.shape[0])) print("Number of fatal accidents: {}".format(killed_df.shape[0])) print("Number of injury accidents: {}".format(injured_df.shape[0])) print("Number of no-injury accidents: {}".format(nothing_df.shape[0]))

Finding clusters of accidents

There are multiple ways to cluster data based on similarities. This notebook limits itself to trying the DBSCAN and the Agglomerative Clustering algorithms. This demonstrates the need to try multiple methods before deciding on a final solution.

For more information on clustering, see:

Find the clusters with DBSCAN

DBSCAN eps parameter is used to identify maximum distance between two samples for one to be considered as in the neighborhood of the other.

Distances: In the Chicago area, the value 0.00015 represents roughly:

  • Horizontal (longitudinal) distance: 40 feet

  • Vertical (latitudinal) distance: 54 feet

  • Diagonal distance: 68 feet

from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler
# The eps value of 0.0015 was chosen after multiple trials # Trying multiple parameter values is essential in finding the best solution db = DBSCAN(eps=0.0015, min_samples=50).fit(crashes_df[['latitude','longitude']]) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) print('Estimated number of clusters: %d' % n_clusters_) print('Estimated number of noise points: %d' % n_noise_)

Display the cluster centers on a map

results_df = crashes_df[['latitude','longitude']].copy(deep=True) results_df['cluster'] = db.labels_ results_df = results_df[results_df['cluster'] > -1].reset_index(drop=True) # remove noise points results_df = results_df.groupby('cluster').agg(latitude=pd.NamedAgg(column='latitude',aggfunc="mean"), longitude = pd.NamedAgg(column='longitude',aggfunc="mean"), cnt = pd.NamedAgg(column='cluster', aggfunc='count') ).\ reset_index().sort_values('cnt', ascending=False).reset_index(drop=True) results_df['cluster'] = results_df.index # lower index bigger cluster minval=results_df['cnt'].min() maxval= 1 + results_df['cnt'].max()
# Use a colormap to colorcode the count of accidents in each cluster # It's difficult to find a good colormap to use. # see: https://matplotlib.org/3.3.3/gallery/color/colormap_reference.html import matplotlib.cm as cm colors = cm.get_cmap('viridis', maxval)(range(maxval),bytes=True) rgbcolors = [] for v in colors : col = np.floor(v * 255) r = int(col[0]) g = int(col[1]) b = int(col[2]) rgbcolors.append('#' + '{0:#08x}'.format(((r * 65536) + (g * 256) + b))[2:])
# Display the average center of each group latlong = results_df[['latitude','longitude']].mean(axis=0) # To center the map chi_map = folium.Map(location=[latlong[0], latlong[1]], zoom_start=10, width="90%", height="90%") # Since the dataframe is ordered, we can display the top 20 only for idx, coord in results_df[0:20].iterrows(): tooltip_content="Cluster: {0}, count: {1}".format(coord['cluster'].astype(int),coord['cnt'].astype(int) ) folium.Circle(radius=500, location=[coord['latitude'], coord['longitude']], # popup=row.hgroup, color=rgbcolors[coord['cnt'].astype(int) ], tooltip=tooltip_content, fill=True, fill_color=rgbcolors[coord['cnt'].astype(int) ] ).add_to(chi_map) chi_map

DBSCAN conclusion

The DBSCAN algorithm requires a lot of tuning to arrive at a desired solution. The results are difficult to evaluate.

In this notebook example, there are 66 clusters and 44914 accidents were dismissed as noise. Different parameter values results in different number of clusters and noise values. The top three clusters are close to each other, around downtowm Chicago. You can see this by zooming in the map and hover the cursor over the cluster centers. The cluster numbers represent theorder of the clusters.

It may be possible to figure out a way to group clusters together to get to a top-5 list. Instead, we'lll look at a different approach.

Find the clusters with hierarchical

for the following algorithm, using all the accidents (51,272) is too much for the resources available in the notebook. The notebook restarts the kernel. I assume it runs out of resources.

Instead, we use the accidents with injuries (fatal or not). Around 7,200 records. The reasoning is that these accidents are more significant and should provide more significant clusters.

If you run out of resources in the next step, change your runtime to a larger one such as: Runtime 22.2.on Python 3.10 XS

# Import objects assuming the k-means section was skipped from scipy import ndimage from scipy.cluster import hierarchy from scipy.cluster.hierarchy import dendrogram from scipy.spatial import distance_matrix from sklearn import manifold from sklearn.cluster import AgglomerativeClustering #from sklearn.datasets.samples_generator import make_blobs from sklearn.datasets import make_blobs

First pass: get the hierarchy

The first step is to see how the hierarchy is put together. The result is seen in a dendrogram.

See also:

ac = AgglomerativeClustering(n_clusters=None,distance_threshold=0) clusters = ac.fit(killed_or_injured_df[['longitude','latitude']]) print("n_clusters_: {}\nn_leaves_: {}".format(clusters.n_clusters_, clusters.n_leaves_)) print("n_connected_components_: {}\nn_features_in_: {}".format(clusters.n_connected_components_, clusters.n_features_in_))
# Utility function, courtesy of Robert Uleman from IBM def get_linkage_matrix(model, **kwargs): # Create the counts of samples under each node counts = np.zeros(model.children_.shape[0]) n_samples = len(model.labels_) for i, merge in enumerate(model.children_): current_count = 0 for child_idx in merge: if child_idx < n_samples: current_count += 1 # leaf node else: current_count += counts[child_idx - n_samples] counts[i] = current_count linkage_matrix = np.column_stack([model.children_, model.distances_, counts]).astype(float) return(linkage_matrix)
linkage_matrix = get_linkage_matrix(clusters)

Display the hierarchy

plt.figure(figsize=(9, 9)) ret = dendrogram(linkage_matrix, truncate_mode='lastp', p=50, no_plot=False, orientation='left')

Comments on the hierarchy

The hierarchy shows how smaller clusters aggregate into larger ones. If we use a vertical line at any point in the hierarchy, we can see how many clusters would be required. It appears at around the horizontal value of 3, we can get exactly 5 clusters.

Since we decided that we wanted five hotspots, it fits our needs.

Second pass: Get 5 clusters

Earlier, we decided to use 5 hotspots. The following cells retirve the five clusters and display them on a map.

The visual result allows us to confirm that the clusters are appropriate.

ac = AgglomerativeClustering(n_clusters=5,distance_threshold=None) clusters = ac.fit(killed_or_injured_df[['longitude','latitude']]) print("n_clusters_: {}\nn_leaves_: {}".format(clusters.n_clusters_, clusters.n_leaves_)) print("n_connected_components_: {}\nn_features_in_: {}".format(clusters.n_connected_components_, clusters.n_features_in_))
# Add the labels to the accidents and aggregate by cluster results_df = killed_or_injured_df[['latitude','longitude']].copy(deep=True) results_df['cluster'] = clusters.labels_ results_df = results_df[results_df['cluster'] > -1].reset_index(drop=True) # remove noise points results_df = results_df.groupby('cluster').agg(latitude=pd.NamedAgg(column='latitude',aggfunc="mean"), longitude = pd.NamedAgg(column='longitude',aggfunc="mean"), cnt = pd.NamedAgg(column='cluster', aggfunc='count') ).\ reset_index().sort_values('cnt', ascending=False).reset_index(drop=True) results_df['cluster'] = results_df.index # lower index bigger cluster minval=results_df['cnt'].min() maxval= 1 + results_df['cnt'].max()
import matplotlib.cm as cm colors = cm.get_cmap('viridis', maxval)(range(maxval),bytes=True) rgbcolors = [] for v in colors : col = np.floor(v * 255) r = int(col[0]) g = int(col[1]) b = int(col[2]) rgbcolors.append('#' + '{0:#08x}'.format(((r * 65536) + (g * 256) + b))[2:])
# Display the average center of each group latlong = results_df[['latitude','longitude']].mean(axis=0) # To center the map chi_map = folium.Map(location=[latlong[0], latlong[1]], zoom_start=10, width="90%", height="90%") # The dataframe is ordered for idx, coord in results_df.iterrows(): tooltip_content="Cluster: {0}, count: {1}".format(coord['cluster'].astype(int),coord['cnt'].astype(int) ) folium.Circle(radius=500, location=[coord['latitude'], coord['longitude']], # popup=row.hgroup, color=rgbcolors[coord['cnt'].astype(int) ], tooltip=tooltip_content, fill=True, fill_color=rgbcolors[coord['cnt'].astype(int) ] ).add_to(chi_map) chi_map

Hover over the results

If you hover your cursor over a cluster center, you can see the cluster number and the nuber of accidents attached to it.

# Show the cluster information # We can se the balance in the clusters by looking at the cnt column results_df.head()

Save the cluster information to a file

If you've spent too long through the notebook, the wslib.upload operation may fail due to the expiration of the token. To refresh the connection, go back up and execute the cell just before "Read the ChicagoCrashes.csv file".

The cell ends with: wslib = access_project_or_space(params)

This will retrieve a new token and re-create the wslib client.

results_df.to_csv("ClusterRecords.csv", index=False) res = wslib.upload_file('ClusterRecords.csv') print("File {} uploaded".format(res['name']))

Clustering conclusion

Many projects may require more thant straightforward supervised learning models.

This notebook demonstrate how "full-code" can be used to work with open-source algorithms to get to a solution. It does not pretend to have gotten the optimal solution, just a possible one that appears promising.

It also shows the difficulty of choosing an algorithm and evaluating the results. Data science is as much of an art as it is a science. It relies on the expertise and experience of data scientists.

Author

Jacques Roy is a member of the IBM Enablement for Data and AI

Copyright © 2023. This notebook and its source code are released under the terms of the MIT License.