GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/tutorial_11/tutorial_11.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 11 - Clustering

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
Explain the k-means clustering algorithm.
Interpret the output of a k-means cluster analysis.
Perform k-means clustering in R using k-means
Visualize the output of k-means clustering in R using a scatter plot facetted across each access
Identify when it is necessary to scale variables before clustering and do this using R
Use the elbow method to choose the number of clusters for k-means
Describe advantages, limitations and assumptions of the kmeans clustering algorithm.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(GGally)
library(broom)

1. Pokemon

We will be working with the Pokemon dataset from Kaggle, which can be found here. This dataset compiles the statistics on 721 Pokemon. The information in this dataset includes Pokemon name, type, health points, attack strength, defensive strength, speed points etc. We are interested in seeing if there are any sub-groups/clusters of pokemon based on these statistics. And if so, how many sub-groups/clusters there are.

Source: https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif

Question 1.0

Load the pokemon.csv dataset and assign it to an object called pm_data.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(pm_data)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(pm_data), 800)
    expect_equal(ncol(pm_data), 13)
    expect_true('Name' %in% colnames(pm_data))
    expect_true('HP' %in% colnames(pm_data))
    expect_true('Attack' %in% colnames(pm_data))
    expect_true('Defense' %in% colnames(pm_data))
    })
print("Success!")

Question 1.1

Create a matrix of plots using ggpairs, choosing columns 5 to 11 (or equivalently, Total to Speed) from pm_data. There are several ways to do this, the most familar way would be using the select function to give a range of column names:

data %>% select(start_column_name:end_column_name)

Another is to pass in the column numbers to the ggpairs function as so:

ggpairs(name_of_dataset, columns = c(number:number))

Assign your answer to an object called pm_pairs.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
pm_pairs

In [ ]:

test_that('Solution is correct', {
    expect_equal(nrow(pm_pairs$data), 800)
    expect_equal(pm_pairs$yAxisLabels %in% (c("Total", "HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed")), c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE))
    expect_true('ggmatrix' %in% c(class(pm_pairs)))
    })
print("Success!")

Question 1.2

Make a scatterplot to visualize the relationship between Speed vs Defense of the Pokemon.

Assign your plot to an object called pm_scatter. Also don't forget to label your axes.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
pm_scatter

In [ ]:

test_that('Solution is correct', {
    expect_true("Speed" == rlang::get_expr(pm_scatter$mapping$x))
    expect_true("Defense" == rlang::get_expr(pm_scatter$mapping$y))
    expect_that("GeomPoint" %in% c(class(pm_scatter$layers[[1]]$geom)) , is_true())
    })
print("Success!")

Question 1.3

Select the columns: Speed and Defense. Create a new data set called km_data with those columns.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(km_data)

In [ ]:

test_that('Solution is correct', {
    expect_equal(ncol(km_data), 2)
    expect_equal(nrow(km_data), 800)
    expect_equal(colnames(km_data), c('Speed', 'Defense'))
    })
print("Success!")

Question 1.4.0

Now, we are going to cluster the Pokemon based on their "Speed" and "Defense" variables. Do we need to scale our variables before clustering? Explain why or why not?

YOUR ANSWER HERE

Question 1.4.1

Now, let's use the kmeans() function to cluster the Pokemon based on their "Speed" and "Defense" variables. For this question, use K = 4.

Note: Since kmeans() initializes observations to random clusters, we set the random number generator seed to 2019.

Assign your answer to an object called pokemon_clusters.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
pokemon_clusters

In [ ]:

test_that('Solution is correct', {
    expect_equal(ncol(pokemon_clusters$centers), 2)
    expect_equal(nrow(pokemon_clusters$centers), 4)
    expect_equal(colnames(pokemon_clusters$centers), c('Speed', 'Defense'))
    expect_equal(class(pokemon_clusters), 'kmeans')
    })
print("Success!")

Question 1.5

Let's visualize the clusters we built in pokemon_clusters. For this we can use the broom package.

"The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames." - Broom Package

Your tasks:

Use the augment function create a data frame with the cluster assignments for each data point from Kmeans (should have the columns Speed and Defense and .cluster).
Create a scatter plot of Speed (x-axis) vs Defense (y-axis) with the points coloured by their cluster assignment. Name this plot answer1.5.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer1.5

In [ ]:

test_that('Solution is correct', {
    expect_true("Speed" == rlang::get_expr(answer1.5$mapping$x))
    expect_true("Defense" == rlang::get_expr(answer1.5$mapping$y))
    expect_true(".cluster" == rlang::get_expr(answer1.5$mapping$colour))
    expect_that("GeomPoint" %in% c(class(answer1.5$layers[[1]]$geom)) , is_true())
    })
print("Success!")

Question 1.6

Below you can see multiple initializations of k-means with different seeds for K = 4. Can you explain what happens and how we can control this in the kmeans function?

YOUR ANSWER HERE

Question 1.7

We know that choosing a K is an important step of the process. We can do this using the total within-cluster sum of squares and seeing how this changes as we change K on a plot (which we call an elbow plot).

For this exercise, from K = 1 to K = 10, calculate the total within-cluster sum of squares. Set nstart to be 10.

We expect the output of this question to be a data frame with the columns k, totss, tot.withinss, betweenss, and iter. Assign your answer to an object called elbow_stats.

Remember, to acess the total within-cluster sum of squares, you can use the glance function also from the broom package:

In [ ]:

glance(pokemon_clusters)

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(elbow_stats)

In [ ]:

test_that('Solution is correct', {
    expect_equal(nrow(elbow_stats), 10)
    expect_equal(sum(c('k', 'tot.withinss') %in% colnames(elbow_stats)), 2)
    })
print("Success!")

Question 1.8

Now go ahead and plot the elbow plot.

Assign your plot to an object called elbow_plot

In [ ]:

# Plot the within group sum of squares on the y-axis.  
# Plot the number of clusters on the x-axis.

# your code here
fail() # No Answer - remove if you provide an answer
elbow_plot

Question 1.9 Multiple Choice:

Based on the elbow plot above, what value of k do you choose? Explain why.

YOUR ANSWER HERE

Question 1.10

Using the value that you chose for k, perform the k-means algorithm and create a plot to visualize the clusters. Again, set nstart to be 10, and set the seed to be 2019.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 1.11

Using Speed and Defense, we could find 3 clusters of pokemons, however, we have more information in our dataset that are throwing away. Let's incorporate all of the numeric values to our kmeans model. Again use nstart = 10 and set the seed to be 2019.

Your tasks:

Select the numeric values only. Do not include the # or Generation columns (they are not pokemon statistics).
Use the elbow plot method to determine the number of clusters.
Train a k-means model with the number of clusters determined in (2).

This time we won't be able to visualize it, but on Thursday we will learn an algorithm that allows us to do visualize multivariate clustering.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 1.12

As mentioned before, visualizing these clusters as they are is not possible given the high-dimensionality of the model. Does the cluster means output helps? Justify your reasoning.

YOUR ANSWER HERE

2. Tourism Reviews

Source: https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif

The Ministry of Land, Infrastructure, Transport and Tourism of Japan is interested in knowing the type of tourists that visit East Asia. They know the majority of their visitors come from this region and would like to stay competitive in the region to keep growing the tourism industry. For this, they have hired us to perform segmentation of the tourists. A dataset from TripAdvisor has been scraped and it's provided to you.

This dataset contains the following variables:

User ID : Unique user id
Category 1 : Average user feedback on art galleries
Category 2 : Average user feedback on dance clubs
Category 3 : Average user feedback on juice bars
Category 4 : Average user feedback on restaurants
Category 5 : Average user feedback on museums
Category 6 : Average user feedback on resorts
Category 7 : Average user feedback on parks/picnic spots
Category 8 : Average user feedback on beaches
Category 9 : Average user feedback on theaters
Category 10 : Average user feedback on religious institutions

Question 2.0

Load the data set (which lives at the URL: https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv) and clean it so that only the columns for the Categories are in the data frame (i.e., remove the User ID column).

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 2.1

Perform k-means and vary K from 1 to 10 to identify the optimal number of clusters. Create an elbow plot to help you choose K. At all steps use nstart = 100 and do not forget to set a seed.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 2.2

From the elbow plot above, which K should you choose? Explain why you chose that K.

YOUR ANSWER HERE

Question 2.3

Do kmeans (don't forget nstart and to also set a seed) again, with the optimal K, and then use the augment function to get the cluster assignments for each point. Name the data frame cluster_assignments.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(cluster_assignments)

For the following 2 questions use the following plot as reference.

In [ ]:

options(repr.plot.height = 6, repr.plot.width = 8)
cluster_assignments %>%
    gather(key = 'category', value = 'value', -.cluster) %>%
    ggplot(aes(value, fill = .cluster)) +
        geom_density(alpha = 0.4, colour = 'white') +
        facet_wrap(~ category, scales = 'free') +
        theme_minimal()

Question 2.4

From the plots above, which categories might we hypothesize are driving the clustering? (i.e., are useful to distinguish between the type of tourists?) And explain why you think so for each category? We list the table of the categories below.

Category 1 : Average user feedback on art galleries
Category 2 : Average user feedback on dance clubs
Category 3 : Average user feedback on juice bars
Category 4 : Average user feedback on restaurants
Category 5 : Average user feedback on museums
Category 6 : Average user feedback on resorts
Category 7 : Average user feedback on parks/picnic spots
Category 8 : Average user feedback on beaches
Category 9 : Average user feedback on theaters
Category 10 : Average user feedback on religious institutions

YOUR ANSWER HERE

Question 2.5

Discuss one disadvantage of not being able to visualize the clusters when dealing with multidimensional data.

YOUR ANSWER HERE

Tutorial 11 - Clustering

Lecture and Tutorial Learning Goals:

1. Pokemon

2. Tourism Reviews

Question 2.0

Question 2.2

Product

Resources

Company