Path: blob/master/2022-spring/materials/tutorial_activity_clustering/tutorial_activity_clustering.ipynb
2048 views
DSCI 100: Introduction to Data Science
Tutorial 10 — Clustering: Class activity
Today, we will be looking at earthquake data from the U.S. Geological Survey.
Each row represents seismograph measurements measured at different stations. We will be preforming a -means clustering algorithm to cluster measurements based on the depth of the event (in kilometers) and magnitude of the event, a variable which characterizes the relative size.
The data set earthquake.csv
is located in the data
folder. Load the data set and call it quake
.
We can use the ggmap
package to visualize the location of the earthquake activity overlaid on a map of the world.
Now, let's make a scatterplot to look at the relationship between depth
and mag
(magnitude).
From this visualization (or from what you know about the data set), what is one additional wrangling step we need to take perform attempting to perform clustering on this data set?
Now, let's use the elbow method to choose the best ! 💪
(That is, the after which the WSSD improves by a diminishing amount.)
What is the optimal k? Proceed by clustering with the correct number of ks and produce a plot to go along with it. This is our final model.
Now that we have our cluster assignments we can overlay the earthquake on top of the map according to their cluster.