Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
CloudPak-Outcomes
GitHub Repository: CloudPak-Outcomes/Outcomes-Projects
Path: blob/main/L4assets/DSandMLOpsAssets/HandsOn/Notebooks/DS comparing clusters.ipynb
1928 views
Kernel: Python 3.10
# @hidden_cell # The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs. from project_lib import Project project = Project(project_id='627da813-ad40-48cf-a59e-974811e89f12', project_access_token='p-2+upcQqH1x6Y9uvOhcCzaGLw==;oisnD8BkwUV8hGfzcFDc1g==:XoDU3iw4mPFSLrQWBDUoiSNdG8ZhDPpywzgAjamLj11HTFkZXqCPcuup8FzA2AG4EAGdcRlxaowkCB8en4MRTfTyKiqkaXr0aQ==') pc = project.project_context from ibm_watson_studio_lib import access_project_or_space wslib = access_project_or_space({'token':'p-2+upcQqH1x6Y9uvOhcCzaGLw==;oisnD8BkwUV8hGfzcFDc1g==:XoDU3iw4mPFSLrQWBDUoiSNdG8ZhDPpywzgAjamLj11HTFkZXqCPcuup8FzA2AG4EAGdcRlxaowkCB8en4MRTfTyKiqkaXr0aQ=='})

Compare SPSS and AgglomerativeClustering accident clusters

CPDaaS: Make sure to first insert a "project token"

Click on the three vertical dots icon in the uper right of the screen, then click on Insert project token

Once inserted, execute the cell.

A project token is only available if you followed the prerequesite instructions to create on in your project.

import warnings import pandas as pd import numpy as np import os from ibm_watson_studio_lib import access_project_or_space # Get access to the prohject API for CPD on-premises if "USER_ID" in os.environ : wslib = access_project_or_space()
# Install folium for map rendering !pip install folium 2>&1 >foliumpip.out import folium

Read the cluster files

The columns are renamed so they match for the three sets

body = wslib.load_data("ClusterRecords.csv") sklearn_df = pd.read_csv(body) sklearn_df.head()
body = wslib.load_data("SPSSClusters.csv") spss_df = pd.read_csv(body) # Order the clusters by count, descending spss_df.sort_values("Record_Count", ascending=False, ignore_index=True, inplace=True) spss_df['$XC-autocluster'] = spss_df.index spss_df = spss_df.rename(columns={"latitude_Mean": "latitude", "longitude_Mean": "longitude", "$XC-autocluster": "cluster", "Record_Count": "cnt"}) spss_df.head()
body = wslib.load_data("SPSSClustersAll.csv") spssall_df = pd.read_csv(body) # Order the clusters by count, descending spssall_df.sort_values("Record_Count", ascending=False, ignore_index=True, inplace=True) spssall_df['$XC-autocluster'] = spss_df.index spssall_df = spssall_df.rename(columns={"latitude_Mean": "latitude", "longitude_Mean": "longitude", "$XC-autocluster": "cluster", "Record_Count": "cnt"}) spssall_df.head()

Display the cluster centers on a map

# Colors to use: Red for the sklearn clusters and blue for the SPSS clusters rgbcolors = ["#FF0000", "#0000CC", "#7FFF00"] # red, blue, and green colornames= ["red", "blue", "green"] prefix = ["sklearn-", "spss-", "spssall-"] cluster_grps = [sklearn_df, spss_df, spssall_df] # Display the average center of each group latlong = sklearn_df[['latitude','longitude']].mean(axis=0) # To center the map chi_map = folium.Map(location=[latlong[0], latlong[1]], zoom_start=10, width="90%", height="90%") # Loop over the two cluster sets for ix in range(3) : for idx, coord in cluster_grps[ix].iterrows(): tooltip_content="Cluster: {0}{1}, count: {2}".format(prefix[ix], coord['cluster'].astype(int),coord['cnt'].astype(int) ) folium.Circle(radius=500, location=[coord['latitude'], coord['longitude']], # popup=row.hgroup, color=rgbcolors[ix], tooltip=tooltip_content, fill=True, fill_color=rgbcolors[ix] ).add_to(chi_map) fg = folium.FeatureGroup(name="{}: {}".format(prefix[ix],colornames[ix])) chi_map.add_child(fg) folium.map.LayerControl('topleft', collapsed= False).add_to(chi_map) chi_map

Comparison conclusion

Creating a list of 5 clusters with SPSS modeler was much easier than through a notebook. The SPSS clustering was done both on the limited dataset used by sklearn and using the entire dataset. Both cluster sets creation completed in a few seconds.

Both SPSS cluster sets end up relatively in the same positions. This seems to indicate that the choice for limiting the number of input records was valid but using the complete input set should be more precise.

The resulting cluster sets are relatively similar between sklearn and SPSS. It would take a large effort to really figure out which one is better. In this lab, we use the clusters from the complete set of accidents.

Author

Jacques Roy is a member of the IBM Enablement for Data and AI

Copyright © 2023. This notebook and its source code are released under the terms of the MIT License.