DSCI 100: Introduction to Data Science

Tutorial 10 — Clustering: Class activity

Today, we will be looking at earthquake data from the U.S. Geological Survey.

Each row represents seismograph measurements measured at different stations. We will be preforming a $k$ -means clustering algorithm to cluster measurements based on the depth of the event (in kilometers) and magnitude of the event, a variable which characterizes the relative size.

In [ ]:

# install the necessary packages for plotting map
# comment out the line below to install, and then recomment it once it is installed (this need only be run once)
# install.packages("ggmap")

In [ ]:

# Load in necessary packages 
library(ggmap)
library(tidyverse)
library(broom) # importantly, don't forget broom for clustering!
options(repr.matrix.max.rows = 6)

The data set earthquake.csv is located in the data folder. Load the data set and call it quake.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
quake

We can use the ggmap package to visualize the location of the earthquake activity overlaid on a map of the world.

In [ ]:

options(repr.plot.width = 15)

mapbox <- c(-179.8454, -62.3062, 179.8348, 79.6239)

my_map <- get_map(location = mapbox, source = "stamen", maptype = "toner")
ggmap(my_map) +
    geom_point(data = quake, 
             aes(x = longitude, y = latitude), 
             color = "red", 
             size = 3,
             alpha = 0.5) +
    labs(x = "Longitude", y = "Latitude")

Now, let's make a scatterplot to look at the relationship between depth and mag (magnitude).

In [ ]:

options(repr.plot.width = 7)
# your code here
fail() # No Answer - remove if you provide an answer
earthquake_plot

From this visualization (or from what you know about the data set), what is one additional wrangling step we need to take perform attempting to perform clustering on this data set?

In [ ]:

# What wrangling step? ...

# First create a dataframe with just the two variables; then scale
# your code here
fail() # No Answer - remove if you provide an answer

Now, let's use the elbow method to choose the best $k$ ! 💪

(That is, the $k$ after which the WSSD improves by a diminishing amount.)

In [ ]:

set.seed(3) # Do not remove

# Remember to use the scaled data frame!

# Try ks = 1 to 9

# Unnest the glanced data frames

# Create the elbow plot

# your code here
fail() # No Answer - remove if you provide an answer

What is the optimal k? Proceed by clustering with the correct number of ks and produce a plot to go along with it. This is our final model.

In [ ]:

set.seed(3)# Do not remove

# Do kmeans with the optimal k
# Augment our model with the original data frame
# Plot the clusters -- remember that the new column (with the groupings) is called .cluster

# your code here
fail() # No Answer - remove if you provide an answer

Now that we have our cluster assignments we can overlay the earthquake on top of the map according to their cluster.

In [ ]:

options(repr.plot.width = 15)
earthquake_with_cluster <- augment(earthquake_clust, quake)

ggmap(my_map) +
    geom_point(data = earthquake_with_cluster, 
             aes(x = longitude, y = latitude, colour = .cluster), 
             size = 5,
             alpha = 0.5) + 
    labs(x = "Longitude", y = "Latitude", colour = "Cluster")

DSCI 100: Introduction to Data Science

Tutorial 10 — Clustering: Class activity

Product

Resources

Company