Path: blob/master/2019-fall/materials/worksheet_10/worksheet_10.ipynb
2051 views
Worksheet 10 - Clustering
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
Explain the k-means clustering algorithm.
Interpret the output of a k-means cluster analysis.
Perform k-means clustering in R using k-means
Visualize the output of k-means clustering in R using a coloured scatter plot
Identify when it is necessary to scale variables before clustering and do this using R
Use the elbow method to choose the number of clusters for k-means
Describe advantages, limitations and assumptions of the kmeans clustering algorithm.
Question 0.1 Multiple Choice:
{points: 1}
In which of the following scenarios would clustering methods likely be appropriate?
A. Identifying sub-groups of houses according to their house type, value, and geographical location
B. Predicting whether a given user will click on an ad on a website
C. Segmenting customers based on their preferences to target advertising
D. Both A. and B.
E. Both A. and C.
Assign your answer to an object called answer0.0
. Your answer should be a single character surrounded by quotes.
Question 0.1 Multiple Choice:
{points: 1}
Which step is the description of the Kmeans algorithm below is incorrect?
Choose the number of clusters
Randomly assign each of the points to one of the clusters
Calculate the position for the cluster centre (centroid) for each of the clusters (this is the middle of the points in the cluster, as measured by straight-line distance)
Re-assign each of the points to the cluster who's centroid is furthest from that point
Repeat steps 1 - 2 until the cluster centroids don't change very much between iterations
Assign your answer to an object called answer0.1
. Your answer should be a single character surrounded by quotes.
Hoppy Craft Beer
Craft beer is a strong market in Canada and the US, and is expanding to other countries as well. If you wanted to get into the craft beer brewing market, you might want to better understand the product landscape. One popular craft beer product is hopped craft beer. Breweries create/label many different kinds of hopped craft beer, but how many different kinds of hopped craft beer are there really when you look at the chemical properties instead of the human labels?
We will start to look at the question by looking at a craft beer data set from Kaggle. In this data set, we will use the alcoholic content by volume (abv
column) and the International bittering units (ibu
column) as variables to try to cluster the beers.
Question 1.0
{points: 1}
Read in the beers.csv
data and assign it to an object called beer
. The data is located within the worksheet_10/data/
folder.
Assign your dataframe answer to an object called beer
.
Question 1.1
{points: 1}
Let's start by visualizing the variables we are going to use in our cluster analysis as a scatter plot. Name the plot object beer_eda
.
Remember to do all the visualization best practices when making this plot, including human-readable labels.
Question 1.2
{points: 1}
We need to clean this data a bit. Specifically, we need to remove the rows where ibu == NA
and select only the columns we are interested in clustering, which are ibu
and abv
.
Assign your answer to an object named clean_beer
.
Question 1.3.1
{points: 1}
Why do we need to scale the variables when using k-means clustering?
A. k-means uses the Euclidean distance to compute how similar data points are to each cluster centre
B. k-means is an iterative algorithm
C. Some variables might be more important for prediction than others
D. To make sure their mean is 0
Assign your answer to an object named answer1.3.1
. Make sure your answer is a single character surrounded by quotes.
Question 1.3.2
{points: 1}
Let's do that now using the map_df
function to apply the scaling to all variables at once. Also, although it's not necessary (but it's also not harmful), let's also centre the variables.
Assign your answer to an object named scaled_beer
. Use the scaffolding provided. Note that centering is included in the default behaviour of R's scale
function.
Question 1.4
{points: 1}
From our exploratory data visualization, 2 seems like a reasonable number of clusters. Use the kmeans
function with centers = 2
to perform clustering with this choice of K.
Assign your model to an object named beer_cluster_k2
. Note that since k-means uses a random initialization, we need to set the seed again; don't change the value!
Question 1.5
{points: 1}
Use the augment
function from the broom
package to get the cluster assignment for each point in the scaled_beer
data frame.
Assign your answer to an object named tidy_beer_cluster_k2
.
Question 1.6
{points: 1}
Create a scatter plot of abv
on the y-axis versus ibu
on the x-axis (using the data in tidy_beer_cluster_k2
) where the points are labelled by their cluster assignment. Name the plot object tidy_beer_cluster_k2_plot
.
Question 1.7.1 Multiple Choice:
{points: 1}
We do not know, however, that two clusters (K = 2) is the best choice for this data set. What can we do to choose the best K?
A. Perform cross-validation for a variety of possible Ks. Choose the one where within-cluster sum of squares distance starts to decrease less.
B. Perform cross-validation for a variety of possible Ks. Choose the one where the within-cluster sum of squares distance starts to decrease more.
C. Perform clustering for a variety of possible Ks. Choose the one where within-cluster sum of squares distance starts to decrease less.
D. Perform clustering for a variety of possible Ks. Choose the one where the within-cluster sum of squares distance starts to decrease more.
Assign your answer to an object called answer1.7.1
. Make sure it is a single character surrounded by quotes.
Question 1.7.2
{points: 1}
Use the glance
function from the broom
library to get the model-level statistics for the clustering we just performed, including total within-cluster sum of squares.
Assign your answer to an object named beer_cluster_k2_model_stats
.
Question 1.8
{points: 1}
Let's now choose the best K for this clustering problem. To do this we need to create a data frame (or a tibble, this time it doesn't matter) with a column named k
, where we vary K from 1 to 10.
Assign your answer to an object named beer_clustering
.
Question 1.9
{points: 1}
Next we use mutate
to create a new column in the beer_clustering
data frame named models
where we use map
to apply the kmeans
function to our scaled_beer
data set for each of the K's.
Question 2.0
{points: 1}
Next we use mutate
again to create a new column called model_statistics
in the beer_clustering
data frame where we use map
to apply the glance
function to each of our models (in the models
column) to get the model-level statistics (this is where we can get the value for total within sum of squares that we use to choose K).
Question 2.1
{points: 1}
Now we use the unnest
function to expand the data frames in the model_statistics
column so that we can access the values for total within sum of squares as a column.
Assign your answer to an object named beer_clustering_unnested
.
Question 2.2
{points: 1}
We now have the the values for total within-cluster sum of squares for each model in a column (tot.withinss
). Let's use it to create a line plot of total within-cluster sum of squares versus k, so that we can choose the best number of clusters to use.
Assign your plot to an object called choose_beer_k
. Total within-cluster sum of squares should be on the y-axis and K should be on the x-axis. Remember to do all the steps needed to make an effective visualization.
Question 2.3
{points: 1}
From the plot above, which K should we choose?
Assign your answer to an object called answer2.3
. Make sure your answer is a single character surrounded by quotation marks.
Question 2.4
(optional - not graded)
In your own words, explain why we chose the K we chose above.
YOUR ANSWER HERE
Question 2.5
(optional - not graded)
What can we conclude from our analysis? How many different types of hoppy craft beer are there in this data set using the two variables we have? Do you think our analysis might change if we added additional variables? Why/why not?
YOUR ANSWER HERE
Question 2.6
(optional - not graded)
Visually verify that 2 clusters is the "best" choice for K for this analysis. Do this by plotting the cluster assignments for the points for each K (each should be its own scatter plot).
YOUR ANSWER HERE