GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2022-spring/materials/tutorial_clustering/tutorial_clustering.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 10 - Clustering

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
Explain the k-means clustering algorithm.
Interpret the output of a k-means cluster analysis.
Perform k-means clustering in R using k-means
Visualize the output of k-means clustering in R using a coloured scatter plot
Identify when it is necessary to scale variables before clustering and do this using R
Use the elbow method to choose the number of clusters for k-means
Describe advantages, limitations and assumptions of the kmeans clustering algorithm.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(repr)
library(GGally)
library(broom)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")

1. Pokemon

We will be working with the Pokemon dataset from Kaggle, which can be found here. This dataset compiles the statistics on 721 Pokemon. The information in this dataset includes Pokemon name, type, health points, attack strength, defensive strength, speed points etc. These are values that apply to a Pokemon's abilities (higher values are better). We are interested in seeing if there are any sub-groups/clusters of pokemon based on these statistics. And if so, how many sub-groups/clusters there are.

Source: https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif

Question 1.0
{points: 1}

Use read_csv to load pokemon.csv from the data/ folder.

Assign your answer to an object called pm_data.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
pm_data

In [ ]:

test_1.0()

Question 1.1
{points: 1}

Create a matrix of plots using ggpairs, choosing columns 5 to 11 (or equivalently, columns Total to Speed) from pm_data. First use the select function to extract columns "Total":"Speed", and then pass the resulting dataframe to ggpairs to plot.

Assign your answer to an object called pm_pairs.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
pm_pairs

In [ ]:

test_1.1()

Question 1.2
{points: 1}

Select the columns Speed and Defense, creating a new dataframe with only those columns.

Assign your answer to an object named km_data.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
km_data

In [ ]:

test_1.2()

Question 1.3
{points: 1}

Make a scatterplot to visualize the relationship between Speed and Defense of the Pokemon. Put the Speed variable on the x-axis, and the Defense variable on the y-axis.

Assign your plot to an object called pm_scatter. Don't forget to do everything needed to make an effective visualization.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
pm_scatter

In [ ]:

test_1.3()

Question 1.4.1
{points: 3}

We are going to cluster the Pokemon based on their Speed and Defense. Will it matter much for our clustering if we scale our variables? Is there any argument against scaling here?

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 1.4.2
{points: 1}

Now, let's use the kmeans function to cluster the Pokemon based on their Speed and Defense variables. For this question, use $k$ = 4. As good practice, let's standardize the data here first using scale. Name the standardized data scaled_km_data.

Assign your answer to an object called pokemon_clusters.

Note: We set the random seed here because kmeans initializes observations to random clusters.

In [ ]:

#DON'T CHANGE THE SEED VALUE BELOW!
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer
pokemon_clusters

In [ ]:

test_1.4.2()

Question 1.5
{points: 1}

Let's visualize the clusters we built in pokemon_clusters. For this we can use the broom package.

"The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames." - Broom Package

Your tasks:

Use the augment function create a data frame with the cluster assignments for each data point from k-means (should have the columns Speed and Defense and .cluster).
Create a scatter plot of Speed (x-axis) vs Defense (y-axis) with the points coloured by their cluster assignment.

Name this plot answer1.5.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer1.5

In [ ]:

test_1.5()

Question 1.6
{points: 3}

Below you can see multiple initializations of k-means with different seeds for K = 4. Can you explain what is happening and how we can mitigate this in the kmeans function?

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 1.7
{points: 1}

We know that choosing a $k$ is an important step of the process. We can do this by examining how the total within-cluster sum of squares changes as we change $k$ on a plot (which we call an elbow plot).

For this exercise, from $k$ = 1 to $k$ = 10, you will calculate the total within-cluster sum of squares:

following good practice, make sure you are using the standardized data (scaled_km_data)
create a tibble with the $k$ values
create a new column poke_clusts by applying kmeans to each value of k (set nstart to be 10)
create a new column glanced by applying glance to each of the results
remove the poke_clusts column
unnest the results of glance

Assign your answer to a tibble object named elbow_stats. It should have the columns k, totss, tot.withinss, betweenss, and iter.

Remember, to acess the total within-cluster sum of squares, you can use the glance function also from the broom package:

In [ ]:

glance(pokemon_clusters)

In [ ]:

set.seed(2020) # DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
elbow_stats

In [ ]:

test_1.7()

Question 1.8
{points: 1}

Create the elbow plot. Put the within-cluster sum of squares on the y-axis, and the number of clusters on the x-axis.

Assign your plot to an object called elbow_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
elbow_plot

In [ ]:

test_1.8()

Question 1.9
{points: 3}

Based on the elbow plot above, what value of $k$ do you choose? Explain why.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 1.10
{points: 3}

Using the value that you chose for k, perform the k-means algorithm, set nstart = 10 and assign your answer to an object called pokemon_final_kmeans.

Augment the data with the final cluster labels and assign your answer to an object called pokemon_final_clusters.

Finally, create a plot called pokemon_final_clusters_plot to visualize the clusters. Include a title, colour the points by the cluster and make sure your axes are human-readable.

In [ ]:

set.seed(2019) # DO NOT REMOVE
# your code here
fail() # No Answer - remove if you provide an answer

Question 1.11
{points: 3}

Using Speed and Defense, we find some number of clusters in our data. However, we have more information in our dataset that might be useful for clustering. Let's incorporate all of the numeric values to our k-means model. Again use nstart = 10.

Your tasks:

Select the numeric type columns only from the full data set pm_data. For example, do not include the # or Generation columns etc. Assign your answer to an object called pm_multi.
Standardize the columns in pm_multi using scale.
From K = 1 to K = 10, calculate the total within-cluster sum of squares. Set nstart to be 10. Assign your answer to an object called pm_multi_elbow_stats.
Use the elbow plot method to determine the number of clusters. Assign your answer to an object called pm_multi_elbow_plot.
Train a k-means model with the number of clusters determined in above. Assign your answer to an object called multi_kmeans.
Print the cluster means for the trained model.

In [ ]:

#DON'T CHANGE THIS SEED VALUE
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer

Question 1.12
{points: 3}

Visualizing these clusters is not a simple task given the high-dimensionality of the model. But do the cluster means output help? Justify your reasoning.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

2. Tourism Reviews

Source: https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif

The Ministry of Land, Infrastructure, Transport and Tourism of Japan is interested in knowing the type of tourists that visit East Asia. They know the majority of their visitors come from this region and would like to stay competitive in the region to keep growing the tourism industry. For this, they have hired us to perform segmentation of the tourists. A dataset from TripAdvisor has been scraped and it's provided to you.

This dataset contains the following variables:

User ID : Unique user id
Category 1 : Average user feedback on art galleries
Category 2 : Average user feedback on dance clubs
Category 3 : Average user feedback on juice bars
Category 4 : Average user feedback on restaurants
Category 5 : Average user feedback on museums
Category 6 : Average user feedback on resorts
Category 7 : Average user feedback on parks/picnic spots
Category 8 : Average user feedback on beaches
Category 9 : Average user feedback on theaters
Category 10 : Average user feedback on religious institutions

Question 2.0
{points: 3}

Load the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv and clean it so that only the Category # columns are in the data frame (i.e., remove the User ID column).

Assign your answer to an object called clean_reviews.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Did not create an object called clean_reviews', {
    expect_true(exists("clean_reviews"))
})
# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.

Question 2.1
{points: 3}

Perform k-means and vary $k$ from 1 to 10 to identify the optimal number of clusters. Use nstart = 100. Assign your answer to a tibble object called elbow_stats that has the columns k, totss, tot.withinss, betweenss, and iter.

Afterwards, create an elbow plot to help you choose $k$ . Assign your answer to an object called tourism_elbow_plot.

In [ ]:

#DON'T CHANGE THIS SEED VALUE
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Did not create an object called elbow_stats', {
    expect_true(exists('elbow_stats'))
})
test_that('Did not create a plot called tourism_elbow_plot', {
    expect_true(exists('tourism_elbow_plot'))
})
# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.

Question 2.2
{points: 3}

From the elbow plot above, which $k$ should you choose? Explain why you chose that $k$ .

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 2.3
{points: 3}

Run kmeans again, with the optimal $k$ , and assign your answer to an object called reviews_clusters. Use nstart = 100. Then, use the augment function to get the cluster assignments for each point. Name the data frame cluster_assignments.

In [ ]:

#DONT CHANGE THIS SEED VALUE
set.seed(2019)

# your code here
fail() # No Answer - remove if you provide an answer

For the following 2 questions use the following plot as reference.

The visualization below is a density plot, you can think of it as a smoothed version of a histogram. Density plots are more effective for comparing multiple distributions. What we are looking for with these visualizations, is to see which variables have difference distributions between the different clusters.

In [ ]:

options(repr.plot.height = 8, repr.plot.width = 15)
cluster_assignments %>%
    pivot_longer(cols = -.cluster, names_to = 'category', values_to = 'value')  %>% 
    ggplot(aes(value, fill = .cluster)) +
        geom_density(alpha = 0.4, colour = 'white') +
        facet_wrap(~ category, scales = 'free') +
        theme_minimal() +
        theme(text = element_text(size = 20))

Question 2.4 Multiple Choice:
{points: 1}

From the plots above, point out the categories that we might hypothesize are driving the clustering? (i.e., are useful to distinguish between the type of tourists?) We list the table of the categories below.

Category 1 : Average user feedback on art galleries
Category 2 : Average user feedback on dance clubs
Category 3 : Average user feedback on juice bars
Category 4 : Average user feedback on restaurants
Category 5 : Average user feedback on museums
Category 6 : Average user feedback on resorts
Category 7 : Average user feedback on parks/picnic spots
Category 8 : Average user feedback on beaches
Category 9 : Average user feedback on theaters
Category 10 : Average user feedback on religious institutions

A. 10, 3, 5, 6, 7

B. 10, 3, 5, 6, 1

C. 10, 3, 4, 6, 7

D. 10, 2, 5, 6, 7

Assign your answer to an object called answer2.4. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.4

In [ ]:

test_that('Did not create an object called answer2.4', {
    expect_true(exists('answer2.4'))
    })
# The remainder of the tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.

Question 2.5
{points: 3}

Discuss one disadvantage of not being able to visualize the clusters when dealing with multidimensional data.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

In [ ]:

source("cleanup.R")

Tutorial 10 - Clustering

Lecture and Tutorial Learning Goals:

1. Pokemon

2. Tourism Reviews

Product

Resources

Company