GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS comparing clusters.ipynb
¹⁹²⁸ views

Kernel: Python 3.10

In [1]:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='627da813-ad40-48cf-a59e-974811e89f12', project_access_token='p-2+upcQqH1x6Y9uvOhcCzaGLw==;oisnD8BkwUV8hGfzcFDc1g==:XoDU3iw4mPFSLrQWBDUoiSNdG8ZhDPpywzgAjamLj11HTFkZXqCPcuup8FzA2AG4EAGdcRlxaowkCB8en4MRTfTyKiqkaXr0aQ==')
pc = project.project_context

from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space({'token':'p-2+upcQqH1x6Y9uvOhcCzaGLw==;oisnD8BkwUV8hGfzcFDc1g==:XoDU3iw4mPFSLrQWBDUoiSNdG8ZhDPpywzgAjamLj11HTFkZXqCPcuup8FzA2AG4EAGdcRlxaowkCB8en4MRTfTyKiqkaXr0aQ=='})

Compare SPSS and AgglomerativeClustering accident clusters

CPDaaS: Make sure to first insert a "project token"

Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token

Once inserted, execute the cell.

A project token is only available if you followed the prerequesite instructions to create on in your project.

In [2]:

import warnings
import pandas as pd
import numpy as np
import os

from ibm_watson_studio_lib import access_project_or_space

# Get access to the prohject API for CPD on-premises
if "USER_ID" in os.environ :
    wslib = access_project_or_space()

In [3]:

# Install folium for map rendering
!pip install folium 2>&1 >foliumpip.out

import folium

Read the cluster files

The columns are renamed so they match for the three sets

In [4]:

body = wslib.load_data("ClusterRecords.csv")
sklearn_df = pd.read_csv(body)
sklearn_df.head()

Out[4]:

In [5]:

body = wslib.load_data("SPSSClusters.csv")
spss_df = pd.read_csv(body)

# Order the clusters by count, descending
spss_df.sort_values("Record_Count", ascending=False, ignore_index=True, inplace=True)
spss_df['$XC-autocluster'] = spss_df.index
spss_df = spss_df.rename(columns={"latitude_Mean": "latitude", "longitude_Mean": "longitude",
                         "$XC-autocluster": "cluster", "Record_Count": "cnt"})
spss_df.head()

Out[5]:

In [6]:

body = wslib.load_data("SPSSClustersAll.csv")
spssall_df = pd.read_csv(body)

# Order the clusters by count, descending
spssall_df.sort_values("Record_Count", ascending=False, ignore_index=True, inplace=True)
spssall_df['$XC-autocluster'] = spss_df.index
spssall_df = spssall_df.rename(columns={"latitude_Mean": "latitude", "longitude_Mean": "longitude",
                         "$XC-autocluster": "cluster", "Record_Count": "cnt"})
spssall_df.head()

Out[6]:

Display the cluster centers on a map

In [7]:

# Colors to use: Red for the sklearn clusters and blue for the SPSS clusters
rgbcolors = ["#FF0000", "#0000CC", "#7FFF00"] # red, blue, and green
colornames= ["red", "blue", "green"]
prefix = ["sklearn-", "spss-", "spssall-"]
cluster_grps = [sklearn_df, spss_df, spssall_df]

# Display the average center of each group
latlong = sklearn_df[['latitude','longitude']].mean(axis=0) # To center the map
chi_map = folium.Map(location=[latlong[0], latlong[1]], zoom_start=10, width="90%", height="90%")

# Loop over the two cluster sets
for ix in range(3) :
    for idx, coord in cluster_grps[ix].iterrows():
        tooltip_content="Cluster: {0}{1}, count: {2}".format(prefix[ix], coord['cluster'].astype(int),coord['cnt'].astype(int) )
        folium.Circle(radius=500,
                  location=[coord['latitude'], coord['longitude']],
                  # popup=row.hgroup,
                  color=rgbcolors[ix],
                  tooltip=tooltip_content,
                  fill=True,
                  fill_color=rgbcolors[ix]
        ).add_to(chi_map)
        fg = folium.FeatureGroup(name="{}: {}".format(prefix[ix],colornames[ix]))
        chi_map.add_child(fg)

folium.map.LayerControl('topleft', collapsed= False).add_to(chi_map)  
chi_map

Out[7]:

Comparison conclusion

Creating a list of 5 clusters with SPSS modeler was much easier than through a notebook. The SPSS clustering was done both on the limited dataset used by sklearn and using the entire dataset. Both cluster sets creation completed in a few seconds.

Both SPSS cluster sets end up relatively in the same positions. This seems to indicate that the choice for limiting the number of input records was valid but using the complete input set should be more precise.

The resulting cluster sets are relatively similar between sklearn and SPSS. It would take a large effort to really figure out which one is better. In this lab, we use the clusters from the complete set of accidents.

Author

Jacques Roy is a member of the IBM Enablement for Data and AI

In [ ]:

Compare SPSS and AgglomerativeClustering accident clusters

CPDaaS: Make sure to first insert a "project token"

Read the cluster files

Display the cluster centers on a map

Comparison conclusion

Author

Product

Resources

Company