GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_11/worksheet_11.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 11 - Clustering

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
Explain the k-means clustering algorithm.
Interpret the output of a k-means cluster analysis.
Perform k-means clustering in R using k-means
Visualize the output of k-means clustering in R using a scatter plot facetted across each access
Identify when it is necessary to scale variables before clustering and do this using R
Use the elbow method to choose the number of clusters for k-means
Describe advantages, limitations and assumptions of the kmeans clustering algorithm.

In [ ]:

### Run this cell before continuing.

library(tidyverse)
library(testthat)
library(digest)
library(forcats)
library(repr)
library(broom)

Question 0.1

In which of the following scenarios would clustering methods likely be appropriate?

A. Identifying sub-groups of houses according to their house type, value, and geographical location

B. Predicting whether a given user will click on an ad on a website

C. Segmenting customers based on their preferences to target advertising

D. Both A. and B.

E. Both A. and C.

Assign your answer to an object called answer0.0.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer0.0

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer0.0), '01a75cb73d67b0f895ff0e61449c7bf8') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 0.2

Which step is the description of the Kmeans algorithm below is incorrect?

Choose the number of clusters
Randomly assign each of the points to one of the clusters
Calculate the position for the cluster centre (centroid) for each of the clusters (this is the middle of the points in the cluster, as measured by straight-line distance)
Re-assign each of the points to the cluster who's centroid is furthest from that point
Repeat steps 1 - 2 until the cluster centroids don't change very much between iterations

Assign your answer to an object called answer0.1.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer0.1

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer0.1), 'dbc09cba9fe2583fb01d63c70e1555a8') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Hoppy craft beer

Craft beer is a strong market in Canada and the US, and is expanding to other countries as well. If you wanted to get into the craft beer brewing market, you might want to better understand the product landscape. One popular craft beer product is hopped craft beer. Breweries create/label many different kinds of hopped craft beer, but how many different kinds of hopped craft beer are there really when you look at the chemical properties instead of the human labels?

We will start to look at the question by looking at a craft beer data set from Kaggle. In this data set, we will use the alcoholic content by volume (abv column) and the International bittering units (ibu column) as variables to try to cluster the beers.

Question 1.0

Read in the beers.csv data set and assign it to an object called beer. The data set is located within the worksheet_11 folder.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(beer)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(beer), 2410)
    expect_equal(ncol(beer), 8)
    expect_true('abv' %in% colnames(beer))
    expect_true('ibu' %in% colnames(beer))
    })
print("Success!")

Question 1.1

Let's start by visualizing the variables we are going to use in our cluster analysis as a scatter plot. Name the plot object beer_eda. Remember to do all the visualization best practices when making this plot, including human readable labels.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
beer_eda

In [ ]:

test_that('Solution is incorrect', {
    expect_true(as.character(rlang::get_expr(beer_eda$mapping$x)) %in% c("ibu", "abv"))
    expect_true(as.character(rlang::get_expr(beer_eda$mapping$y)) %in% c("ibu", "abv"))
    expect_that("GeomPoint" %in% c(class(beer_eda$layers[[1]]$geom)) , is_true())
    expect_false(beer_eda$labels$x %in% c("ibu", "abv"))
    expect_false(beer_eda$labels$y %in% c("ibu", "abv"))
    })
print("Success!")

Question 1.2

We need to clean up this data set a bit. Specifically, we need to remove the rows where ibu == NA and select only the columns we are interested in clustering, which are ibu and abv. Name the cleaned data clean_beer.

hint - is.na() will be useful in this filter

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(clean_beer)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(clean_beer), 1405)
    expect_equal(ncol(clean_beer), 2)
    expect_true('abv' %in% colnames(clean_beer))
    expect_that('name' %in% colnames(clean_beer), is_false())
    })
print("Success!")

Question 1.3

Given that Kmeans clustering uses a distance function as part of its algorithm, it is important to scale the variables. Let's do that now, and let's do it using the map_df function (so we can do it to all variables at once - this will be really useful for larger data sets!). Also, although it's not necessary (but it's also not harmful), let's also centre the variables. This is the default to R's scale function.

Name the data frame scaled_beer.

In [ ]:

# ... <- ... %>% 
#    map_df(...)

# your code here
fail() # No Answer - remove if you provide an answer
head(scaled_beer)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(scaled_beer), 1405)
    expect_equal(ncol(scaled_beer), 2)
    expect_true('abv' %in% colnames(scaled_beer))
    expect_true('ibu' %in% colnames(scaled_beer))
    expect_that('name' %in% colnames(scaled_beer), is_false())
    expect_that(min(scaled_beer$ibu) < 1, is_true())
    expect_that(max(scaled_beer$ibu) < 4, is_true())
    expect_that(min(scaled_beer$abv) < -2, is_true())
    expect_that(max(scaled_beer$abv) < 5, is_true())
    })
print("Success!")

Question 1.4

From our exploratory data visualization, 2 clusters seems like it might be a reasonable number of clusters. Use the kmeans function with centers = 2 to perform clustering with this choice of K. Name your model object beer_cluster_k2. Given that the Kmeans algorithm uses a random start, set the seed to be 1234 so that your result is reproducible.

In [ ]:

# set.seed(1234)
# ... <- kmeans(..., centers = 2)
# your code here
fail() # No Answer - remove if you provide an answer
beer_cluster_k2

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(class(beer_cluster_k2), 'kmeans')
    expect_equal(round(beer_cluster_k2$tot.withinss), 1110)
    expect_equal(nrow(beer_cluster_k2$centers), 2)
    })
print("Success!")

Question 1.5

Use broom's augment function to get the cluster assignment for each point in a tidy data frame. Name that data frame tidy_beer_cluster_k2.

In [ ]:

# ... <- augment(beer_cluster_k2, scaled_beer)
# your code here
fail() # No Answer - remove if you provide an answer
head(tidy_beer_cluster_k2)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(tidy_beer_cluster_k2), 1405)
    expect_equal(ncol(tidy_beer_cluster_k2), 3)
    expect_true('abv' %in% colnames(tidy_beer_cluster_k2))
    expect_true('ibu' %in% colnames(tidy_beer_cluster_k2))
    expect_true('.cluster' %in% colnames(tidy_beer_cluster_k2))
    })
print("Success!")

Question 1.6

Create a scatter plot of ibu versus abv (using the data in tidy_beer_cluster_k2) where the points are labelled by their cluster assignment. Name the plot object tidy_beer_cluster_k2_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
tidy_beer_cluster_k2_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_true(as.character(rlang::get_expr(tidy_beer_cluster_k2_plot$mapping$x)) %in% c("ibu", "abv"))
    expect_true(as.character(rlang::get_expr(tidy_beer_cluster_k2_plot$mapping$y)) %in% c("ibu", "abv"))
    expect_true(as.character(rlang::get_expr(tidy_beer_cluster_k2_plot$mapping$colour)) == '.cluster')
    expect_that("GeomPoint" %in% c(class(tidy_beer_cluster_k2_plot$layers[[1]]$geom)) , is_true())
    expect_false(tidy_beer_cluster_k2_plot$labels$x %in% c("ibu", "abv"))
    expect_false(tidy_beer_cluster_k2_plot$labels$y %in% c("ibu", "abv"))
    })
print("Success!")

Question 1.7

We do not know, however, that two clusters (K = 2) is the best choice for this data set. What can we do to choose the best K?

A. Perform cross-validation for a variety of possible K's and choose the one with the smallest total within sum of squares

B. Perform cross-validation for a variety of possible K's and choose the one with the greatest drop in total within sum of squares

C. Perform clustering for a variety of possible K's and choose the one with the smallest total within sum of squares

D. Perform clustering for a variety of possible K's and choose the choose the one with the greatest drop in total within sum of squares

Assign your answer to an object called answer1.7.

In [ ]:

# Assign your answer to an object called: answer1.7
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
answer1.7

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer1.7), 'c1f86f7430df7ddb256980ea6a3b57a4') # we hid the answer to the test here so you can't see it, but we can still run the test
})
print("Success!")

Question 1.71

Use broom's glance function to get the model-level statistics for the clustering we just performed, including total within sum of squares. Name the data frame beer_cluster_k2_model_stats.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
beer_cluster_k2_model_stats

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(beer_cluster_k2_model_stats), 1) 
    expect_equal(ncol(beer_cluster_k2_model_stats), 4)
    expect_true('tot.withinss' %in% colnames(beer_cluster_k2_model_stats))
})
print("Success!")

Question 1.8

Let's now choose the best K for this clustering problem. To do this we need to create a data frame (or a tibble, this time it doesn't matter) with a column named k, where we vary K from 1 to 10. Name this data frame beer_clustering.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
beer_clustering

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(beer_clustering), 10) 
    expect_equal(ncol(beer_clustering), 1) 
    expect_equal(colnames(beer_clustering), 'k') 
})
print("Success!")

Question 1.9

Next we use mutate to create a new column in the beer_clustering data frame named models where we use map to apply the kmeans function to our scaled_beer data set for each of the K's.

In [ ]:

# ... <- ... %>% mutate(models = map(k, ~kmeans(..., .x)))
# your code here
fail() # No Answer - remove if you provide an answer
glimpse(beer_clustering)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(beer_clustering), 10) 
    expect_equal(ncol(beer_clustering), 2) 
    expect_true('k' %in% colnames(beer_clustering)) 
    expect_true('models' %in% colnames(beer_clustering))
    expect_equal(class(beer_clustering$models[[1]]), 'kmeans')
})
print("Success!")

Question 2.0

Next we use mutate again to create a new column called model_statistics in the beer_clustering data frame where we use map to apply the glance function to each of our models (in the models column) to get the model-level statistics (this is where we can get the value for total within sum of squares that we use to choose K).

In [ ]:

# ... <- ... %>% mutate(... = map(models, ...))
# your code here
fail() # No Answer - remove if you provide an answer
glimpse(beer_clustering)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(beer_clustering), 10) 
    expect_equal(ncol(beer_clustering), 3) 
    expect_true('k' %in% colnames(beer_clustering)) 
    expect_true('models' %in% colnames(beer_clustering))
    expect_true('model_statistics' %in% colnames(beer_clustering))
    expect_equal(class(beer_clustering$models[[1]]), 'kmeans')
    expect_true('data.frame' %in% class(beer_clustering$model_statistics[[1]]))
})
print("Success!")

Question 2.1

Now we use the unnest function to expand the data frames in the model_statistics column so that we can access the values for total within sum of squares as a column. Name the modified data frame beer_clustering_unnested.

In [ ]:

# ... <- ... %>% unnest(model_statistics)
# your code here
fail() # No Answer - remove if you provide an answer
glimpse(beer_clustering_unnested)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(beer_clustering_unnested), 10) 
    expect_equal(ncol(beer_clustering_unnested), 6) 
    expect_true('k' %in% colnames(beer_clustering_unnested)) 
    expect_true('models' %in% colnames(beer_clustering_unnested))
    expect_false('model_statistics' %in% colnames(beer_clustering_unnested))
    expect_equal(class(beer_clustering_unnested$models[[1]]), 'kmeans')
    expect_true('tot.withinss' %in% colnames(beer_clustering_unnested))
})
print("Success!")

Question 2.2

Now that we have the the values for total within sum of squares for each model in a column (tot.withinss), we can use it to create a line plot of total within sum of squares versus k (so that we can choose the best K). Total within sum of squares should be on the y-axis and K should be on the x-axis. Create this plot and name the plot object choose_beer_k.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
choose_beer_k

In [ ]:

test_that('Solution is incorrect', {
    expect_true(as.character(rlang::get_expr(choose_beer_k$mapping$x)) %in% c("k"))
    expect_true(as.character(rlang::get_expr(choose_beer_k$mapping$y)) %in% c("tot.withinss"))
    expect_that("GeomLine" %in% c(class(choose_beer_k$layers[[1]]$geom)) , is_true())
    })
print("Success!")

Question 2.3

From the plot above, which K should we choose? Save your answer (as a whole number) to a variable named answer2.3.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.3

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(answer2.3)), 'db8e490a925a60e62212cefc7674ca02')
    })
print("Success!")

Question 2.4 (optional - not graded)

In your own words, explain why we chose the K we chose above.

YOUR ANSWER HERE

Question 2.5 (optional - not graded)

What can we conclude from our analysis? How many different types of hoppy craft beer are there in this data set using the two variables we have? Do you think our analysis might change if we added additional variables? Why/why not?