GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2021-summer/materials/worksheet_10/worksheet_10.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 10 - Clustering

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
Explain the k-means clustering algorithm.
Interpret the output of a k-means cluster analysis.
Perform k-means clustering in R using k-means
Visualize the output of k-means clustering in R using a coloured scatter plot
Identify when it is necessary to scale variables before clustering and do this using R
Use the elbow method to choose the number of clusters for k-means
Describe advantages, limitations and assumptions of the k-means clustering algorithm.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(forcats)
library(repr)
library(broom)
options(repr.matrix.max.rows = 6)
source('tests_worksheet_10.R')
source("cleanup_worksheet_10.R")

Question 0.0 Multiple Choice:
{points: 1}

In which of the following scenarios would clustering methods likely be appropriate?

A. Identifying sub-groups of houses according to their house type, value, and geographical location

B. Predicting whether a given user will click on an ad on a website

C. Segmenting customers based on their preferences to target advertising

D. Both A. and B.

E. Both A. and C.

Assign your answer to an object called answer0.0. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_0.0()

Question 0.1 Multiple Choice:
{points: 1}

Which step in the description of the k-means algorithm below is incorrect?

Choose the number of clusters
Randomly assign each of the points to one of the clusters
Calculate the position for the cluster centre (centroid) for each of the clusters (this is the middle of the points in the cluster, as measured by straight-line distance)
Re-assign each of the points to the cluster who's centroid is furthest from that point
Repeat steps 1 - 2 until the cluster centroids don't change very much between iterations

Assign your answer to an object called answer0.1. Your answer should be a single numerical character surrounded by quotes.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_0.1()

Hoppy Craft Beer

Craft beer is a strong market in Canada and the US, and is expanding to other countries as well. If you wanted to get into the craft beer brewing market, you might want to better understand the product landscape. One popular craft beer product is hopped craft beer. Breweries create/label many different kinds of hopped craft beer, but how many different kinds of hopped craft beer are there really when you look at the chemical properties instead of the human labels?

We will start to look at the question by looking at a craft beer data set from Kaggle. In this data set, we will use the alcoholic content by volume (abv column) and the International bittering units (ibu column) as variables to try to cluster the beers. The abv variable has values 0 (indicating no alcohol) up to 1 (pure alcohol) and the ibu variable quantifies the bitterness of the beer (higher values indicate higher bitterness).

Question 1.0
{points: 1}

Read in the beers.csv data using read_csv() and assign it to an object called beer. The data is located within the worksheet_10/data/ folder.

Assign your dataframe answer to an object called beer.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
beer

In [ ]:

test_1.0()

Question 1.1
{points: 1}

Let's start by visualizing the variables we are going to use in our cluster analysis as a scatter plot. Put ibu on the horizontal axis, and abv on the vertical axis. Name the plot object beer_eda.

Remember to follow the best visualization practices, including adding human-readable labels to your plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
beer_eda

In [ ]:

test_1.1()

Question 1.2
{points: 1}

We need to clean this data a bit. Specifically, we need to remove the rows where ibu is NA, and select only the columns we are interested in clustering, which are ibu and abv.

Assign your answer to an object named clean_beer.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
clean_beer

In [ ]:

test_1.2()

Question 1.3.1 Multiple Choice:
{points: 1}

Why do we need to scale the variables when using k-means clustering?

A. k-means uses the Euclidean distance to compute how similar data points are to each cluster center

B. k-means is an iterative algorithm

C. Some variables might be more important for prediction than others

D. To make sure their mean is 0

Assign your answer to an object named answer1.3.1. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.3.1()

Question 1.3.2
{points: 1}

Let's do that scaling now. Recall that we used a recipe for scaling when doing classification and regression. This is because we needed to be able to split train and test data, compute a standardization on just training data, and apply the standardization to both train and test data.

But in clustering, there is no train/test split. So let's use the much simpler scale function in R. scale takes in a column of a dataframe and outputs the standardized version of it. We can therefore apply scale to all variables in the cleaned data frame using the map_df function.

Note: you could still use a recipe to do this, using prep/bake appropriately. But scale is much simpler.

Assign your answer to an object named scaled_beer. Use the scaffolding provided.

In [ ]:

# ... <- ... %>% 
#    map_df(...)

# your code here
fail() # No Answer - remove if you provide an answer
scaled_beer

In [ ]:

test_1.3.2()

Question 1.4
{points: 1}

From our exploratory data visualization, 2 seems like a reasonable number of clusters. Use the kmeans function with centers = 2 to perform clustering with this choice of $k$ .

Assign your model to an object named beer_cluster_k2. Note that since k-means uses a random initialization, we need to set the seed again; don't change the value!

In [ ]:

# DON'T CHANGE THE SEED VALUE!
set.seed(1234)

# ... <- kmeans(..., centers = 2)
# your code here
fail() # No Answer - remove if you provide an answer
beer_cluster_k2

In [ ]:

test_1.4()

Question 1.5
{points: 1}

Use the augment function from the broom package to get the cluster assignment for each point in the scaled_beer data frame.

Assign your answer to an object named tidy_beer_cluster_k2.

In [ ]:

# ... <- augment(..., ...)
# your code here
fail() # No Answer - remove if you provide an answer
tidy_beer_cluster_k2

In [ ]:

test_1.5()

Question 1.6
{points: 1}

Create a scatter plot of abv on the y-axis versus ibu on the x-axis (using the data in tidy_beer_cluster_k2) where the points are labelled by their cluster assignment. Name the plot object tidy_beer_cluster_k2_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
tidy_beer_cluster_k2_plot

In [ ]:

test_1.6()

Question 1.7.1 Multiple Choice:
{points: 1}

We do not know, however, that two clusters ( $k$ = 2) is the best choice for this data set. What can we do to choose the best K?

A. Perform cross-validation for a variety of possible $k$ 's. Choose the one where within-cluster sum of squares distance starts to decrease less.

B. Perform cross-validation for a variety of possible $k$ 's. Choose the one where the within-cluster sum of squares distance starts to decrease more.

C. Perform clustering for a variety of possible $k$ 's. Choose the one where within-cluster sum of squares distance starts to decrease less.

D. Perform clustering for a variety of possible $k$ 's. Choose the one where the within-cluster sum of squares distance starts to decrease more.

Assign your answer to an object called answer1.7.1. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.7.1()

Question 1.7.2
{points: 1}

Use the glance function from the broom library to get the model-level statistics for the clustering we just performed, including total within-cluster sum of squares.

Assign your answer to an object named beer_cluster_k2_model_stats.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
beer_cluster_k2_model_stats

In [ ]:

test_1.7.2()

Question 1.8
{points: 1}

Let's now choose the best K for this clustering problem. To do this we need to create a tibble with a column named k, where $k$ has values 1 to 10.

Assign your answer to an object named beer_ks.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
beer_ks

In [ ]:

test_1.8()

Question 1.9
{points: 1}

Next we use mutate to create a new column named models in beer_ks, where we use map to apply the kmeans function to our scaled_beer data set for each of the $k$ 's .

This is a more complicated use of the map function than we have seen previously in the course. This is because we need to iterate over the different values of $k$ , which is the second argument to the kmeans function. In the past, we have used map only to iterate over values of the first argument of a function. Since that is the default, we could simply write map(data_frame, function_name). This won’t work here; we need to provide our data frame as the first argument to the kmeans function. You might want to refer back to the section of the textbook that explains before completing this question: K-means in R

This will give us a data frame with two columns, the first being k, which holds the values of the $k$ 's we mapped (i.e, iterated) over. The second will be models, which holds the $k$ -means model fits for each of the $k$ 's we mapped over.

This second column is a new type of column, that we have not yet encountered in this course. It is called a list column. It can contain more complex objects, like models and even data frames (as we will see in a later question). In Jupyter it is easier to preview and understand this more complex data frame using the print function as opposed to calling the data frame itself as we usually do. This is a current limitation of Jupyter's rendering of R's output and will hopefully be fixed in the future.

Assign your answer to an object named beer_clustering.

In [ ]:

set.seed(1234) # DO NOT REMOVE
# ... <- ... %>%
    # mutate(models = map(..., ~kmeans(scaled_beer, .x)))

# your code here
fail() # No Answer - remove if you provide an answer
print(beer_clustering)

In [ ]:

test_1.9()

Question 2.0
{points: 1}

Next we use mutate again to create a new column called model_statistics where we use map to apply the glance function to each of our models (in the models column) to get the model-level statistics (this is where we can get the value for total within sum of squares that we use to choose K).

Here, because we are interating over the first argument to the glance function (which is the models column), we can use the simpler syntax for map as we did earlier in the course.

Assign your answer to an object named beer_model_stats.

In [ ]:

# ... <- ... %>% 
    # mutate(... = map(models, ...))

# your code here
fail() # No Answer - remove if you provide an answer
print(beer_model_stats)

In [ ]:

test_2.0()

Here when we create our third column, called model_statistics, we can see it is another list column! This time it contains data frames instead of models! Run the cell below to see how you can look at the data frame that is stored as the first element of the model_statistics column (model where we used $k$ = 1):

In [ ]:

beer_model_stats %>% 
    slice(1) %>% 
    pull(model_statistics)

Question 2.1
{points: 1}

Now we use the unnest function to expand the data frames in the model_statistics column so that we can access the values for total within sum of squares as a column.

Assign your answer to an object named beer_clustering_unnested.

In [ ]:

# ... <- ... %>% unnest(model_statistics)
# your code here
fail() # No Answer - remove if you provide an answer
print(beer_clustering_unnested)

In [ ]:

test_2.1()

Question 2.2
{points: 1}

We now have the the values for total within-cluster sum of squares for each model in a column (tot.withinss). Let's use it to create a line plot with points of total within-cluster sum of squares versus k, so that we can choose the best number of clusters to use.

Assign your plot to an object called choose_beer_k. Total within-cluster sum of squares should be on the y-axis and K should be on the x-axis. Remember to follow the best visualization practices, including adding human-readable labels to your plot.

In [ ]:

options(repr.plot.width = 8, repr.plot.height = 7)

# your code here
fail() # No Answer - remove if you provide an answer
choose_beer_k

In [ ]:

test_2.2()

Question 2.3
{points: 1}

From the plot above, which $k$ should we choose?

Assign your answer to an object called answer2.3. Make sure your answer is a single numerical character surrounded by quotation marks.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.3()

Question 2.4
{points: 1}

Why did we choose the $k$ we chose above?

A. It had the greatest total within-cluster sum of squares

B. It had the smallest total within-cluster sum of squares

C. Increasing $k$ further than this only decreased the total within-cluster sum of squares a small amount

D. Increasing k further than this only increased the total within-cluster sum of squares a small amount

Assign your answer to an object called answer2.4. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.4()

Question 2.5 Multiple Choice:
{points: 1}

What can we conclude from our analysis? How many different types of hoppy craft beer are there in this data set using the two variables we have?

A. 1

B. 2

C. 3

D. 4

Assign your answer to an object called answer2.5. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.5()

Question 2.6 True or false:
{points: 1}

Our analysis might change if we added additional variables, true or false?

Assign your answer to an object called answer2.6. Make sure your answer is written in lowercase and is surrounded by quotation marks (e.g. "true" or "false").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.6()

In [ ]:

source("cleanup_worksheet_10.R")

Worksheet 10 - Clustering

Lecture and Tutorial Learning Goals:

Hoppy Craft Beer

Product

Resources

Company