Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_12/worksheet_12.ipynb
2051 views
Kernel: R

Visualizing high dimensional data

One of the problems that we encounter when doing data science is the high dimensionality of our datasets. We have multiple variables that we wish to look at the same time to understand the processes and patterns going on in there. However, we can only see up to 3 dimensions (really only 2 well, so even when we get to a dataset like the iris dataset, which only contains 4 variables aside from the species labels, we can't fully visualize what's going on. One solution is to visualize it across multiple 2D plots (as you did in the last tutorial) but that can be sub-optimal. Is there nothing else we can do? No! There are alternatives. In general the broad category of algorithms used for this and other problems of high dimensionality is named dimensionality reduction. One of the most famous and used techniques is called t-SNE. (

In this worksheet, we will demonstrate how you can do t-sne in R to visualize the high-dimensional data sets you were working with in your clustering tutorial. We do not expect you to be able to explain how this algorithm works, but we we do expect you to learn and remember is the following:

  • what t-sne is used for (the problem it is solving)

  • how to use R to perform t-sne

Additional videos/readings (beyond the scope of this course):

Travel Reviews

We will be working with the travel reviews from East Asia again. During the tutorial we found 2 clusters to be the ideal number of clusters. However, we couldn't visualize how did our clusters looked. In this exercise we will apply t-SNE to the reviews dataset to be able to visualize our clusters. The code supplied below replicates the analysis we completed in the tutorial.

library(tidyverse) library(broom) library(testthat) library(repr)
reviews <- read_csv('data/tripadvisor_review.csv') clean_reviews <- reviews %>% select(-`User ID`) head(clean_reviews)
set.seed(2019) reviews_clusters <- kmeans(clean_reviews, centers = 2, nstart = 100) reviews_clusters

Question 1.0

Install and load the Rtsne package.

# your code here fail() # No Answer - remove if you provide an answer

Question 1.1

When doing dimensionality reduction one of the things we need to decide is how many dimentions do we want to reduce to? In this case we will be reducing to 2 dimensions, although with t-SNE it is possible to reduce up to 3 dimensions.

Your tasks:

  • Set the seed to be 2019.

  • Apply t-SNE to the clean_reviews dataset, reducing to 2 dimensions.

  • Assing the result to an object named tsne_reviews.

The function you will need to do this is called Rtsne. You will need to give 3 at least these 3 arguments:

  • X (the data set you want to perform dimensionality reduction on)

  • dims the number of dimensions you want to reduce the data set to (here 2)

  • check_duplicates (this specifies to checks to see if there are duplicates are present in the data set. The default is TRUE, but it is best practice set it to FALSE to ensure there are no duplicates present).

  • note - a random number process is used in the algorithm, and so you should set a seed to make your analysis reproducible

# ... <- Rtsne(X = ..., dims = ..., check_duplicates = ...) # your code here fail() # No Answer - remove if you provide an answer tsne_reviews
test_that('Solution is incorrect', { expect_equal(tsne_reviews$N, 980) expect_equal(ncol(tsne_reviews$Y), 2) }) print("Success!")

Question 1.2

tsne_reviews contains a matrix with the lower dimensionality representation of the original points. This is contained inside of the Y attribte of the object returned from the Rtsne function. Your task is to extract this data from the more complex object so you can easily use it to get the cluster assignments. Name this new object tsne_results.

# ... <- ...$Y # your code here fail() # No Answer - remove if you provide an answer head(tsne_results)
test_that('Solution is incorrect', { expect_equal(nrow(tsne_results), 980) expect_equal(ncol(tsne_results), 2) }) print("Success!")

Question 1.3

We can use the augment function (from the broom package) to retrieve the kmeans cluster assignments given to you in the first snippets of code. Extract these from the reviews_clusters object (which was the object returned from the kmeans function) and assign these to an object called tsne_clusters.

# ... <- ...(..., tsne_results) # your code here fail() # No Answer - remove if you provide an answer head(tsne_clusters)
test_that('Solution is incorrect', { expect_equal(nrow(tsne_clusters), 980) expect_equal(ncol(tsne_clusters), 3) expect_equal(length(unique(tsne_clusters$.cluster)), 2) }) print("Success!")

Question 1.4

Plot the data points (columns X1 and X2) from your tsne_clusters object (this is a reduced dimensionality representation of the original data set). Color the data points by their assigned cluster (.cluster column). Assign the plot to an object named tsne_plot.

# your code here fail() # No Answer - remove if you provide an answer tsne_plot
test_that('Solution is correct', { expect_true("X1" %in% c(rlang::get_expr(tsne_plot$mapping$x), rlang::get_expr(tsne_plot$mapping$y))) expect_true("X2" %in% c(rlang::get_expr(tsne_plot$mapping$x), rlang::get_expr(tsne_plot$mapping$y))) expect_equal(as.character(rlang::get_expr(tsne_plot$mapping$colour)), ".cluster") expect_that("GeomPoint" %in% c(class(tsne_plot$layers[[1]]$geom)) , is_true()) }) print("Success!")

Question 1.4.1 (Optional)

Another argument you can set when doing t-sne is perplexity. Distances between nearest neighbors is part of the The t-sne algorithm. You can loosely think of perplexity as a way of specifying the number of nearest neighbors to use. A small value for perplexity (e.g., 5) corresponds to a smaller number of nearest neighbours (paying more attention to local structure), where a larger values suggests we should use a larger number of nearest neighbours (paying more attention to the global structure). The default for R's Rtsne function is 30.

Try running t-sne again and visualizing it twice more, once with perplexity = 3 and once with perplexity = 150. Set the seed to be 2019.

# your code here fail() # No Answer - remove if you provide an answer

Question 1.4.2

How do we choose perplexity (or other such values) for t-sne? Typically we try a variety of values and look at the resultant visualization and choose the one that gives the best clustering visually.

For the 3 perplexities we tried, which one do you think is the best? Explain why.

YOUR ANSWER HERE

Question 1.5 (Optional - not graded)

Now that you can visualize the cluster assignations, does this change your interpretation of the kmeans results you obtained in the tutorial? Does it weaken or strengthen your opinion that this data can be separated into two clusters?

YOUR ANSWER HERE

Question 1.6 (Optional - not graded)

Use t-sne to visualize the pokemon data set in two dimensions, overlaying the k-means cluster assignments on the visualization.

# your code here fail() # No Answer - remove if you provide an answer

Getting R + Jupyter working for you outside of our course

At some point after the exam, you will lose access to the JupyterHub server where you have been doing your course work. If you want to continue to use R + Jupyter (for other courses at UBC, or for your work after UBC) you have two options:

  1. a server solution

  2. a local installation solution

We will point you to how you can do both, as well as provide guidance to take a copy of your homework from our Canvas JupyterHub server.

1. a server solution

  • As a student at UBC, you have access to another JupyterHub that you can access using your UBC CWL: https://ubc.syzygy.ca/

  • If you have a Google account, you have access to another JupyterHub that does not depend on you being a UBC student (i.e., having a valid CWL): https://cybera.syzygy.ca/

2. a local installation solution

  • Depending on your device, you may be able to install Jupyter and R on it. Below we provide instructions for the 3 major operating systems (Mac, Windows and Linux)

A. Install Jupyter + R on Mac

A.1. Install R

Go to https://cran.r-project.org/bin/macosx/ and download the latest version of R for Mac (Should look something like this: R-3.5.1.pkg). Open the file and follow the installer instructions.

A.2. Install Jupyter

Head to https://www.anaconda.com/download/#macos and download the Anaconda version for Mac OS. Follow the instructions on that page to run the installer. Anaconda is a software bundle installs Jupyter and Python and a few other tools. We recommend using it because it makes installation much simpler than other methods.

A.3. Install the IR kernel

The IR kernel lets you connect R with Jupyter, to install it follow the instructions below:

  • Open terminal (how to video), and type R to open R.

  • Type the following commands into R:

install.packages('IRkernel', repos = 'http://cran.us.r-project.org')

IRkernel::installspec()

IRkernel::installspec(user = FALSE)

  • Next, exit R by typing: q() (type n when prompted to not save the workspace)

A.4 Test if it works:

  • Open terminal and type: jupyter notebook

The above command should open a web browser, with Jupyter's home inside it. Try creating a new R notebook and running some simple R code (e.g., print("hello!")) to test that it works.

B. Install Jupyter + R on Windows

B.1. Install R

Go to https://cran.r-project.org/bin/windows/base/ and download the latest version of R for Windows (Should look something like this: Download R 3.5.1 for Windows). Open the file and follow the installer instructions.

B.2. Install Jupyter

Head to https://www.anaconda.com/download/#windows and download the Anaconda version for Windows with Python 3.6. After the download has finished, run the installer selecting the following options:

  • On the Advanced Installation Options page, check both boxes (see image below)

  • For all other pages, use the default options

B.3. Install the IR kernel

The IR kernel lets you connect R with Jupyter, to install it follow the instructions below:

  • Open R (double click on Desktop Icon or open it using the START menu)

  • Type the following commands into R:

install.packages('IRkernel', repos = 'http://cran.us.r-project.org')

IRkernel::installspec()

IRkernel::installspec(user = FALSE)

  • Next, exit/close R

B.4 Test if it works:

  • Open Windows Command Prompt (select the Start button, type cmd, and click Command Prompt from the list) and type: jupyter notebook

The above command should open a web browser, with Jupyter's home inside it. Try creating a new R notebook and running some simple R code (e.g., print("hello!")) to test that it works.

C. Install Jupyter + R on Linux

C.1. Install R

Open /etc/apt/sources.list and add the following line to the end of the file (choose the correct one for your version of Ubuntu):

  • for Ubuntu 18.04.1 (Bionic Beaver) add: deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/

  • for Ubuntu 17.10.1 (Artful Aardvark) add: deb https://cloud.r-project.org/bin/linux/ubuntu artful/

  • for Ubuntu 16.04.4 (Xenial Xerus) add: deb https://cloud.r-project.org/bin/linux/ubuntu xenial/

  • for Ubuntu 14.04.5 (Trusty Tahr) add: deb https://cloud.r-project.org/bin/linux/ubuntu trusty/

Next, add the key ID for the CRAN network:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

Then, update the repository:

sudo apt-get update

Then, install the R binaries:

sudo apt-get install r-base

Finally, install the r-base-dev package:

sudo apt-get install r-base-dev

After installation, in terminal type the following to ask for the version:

R --version

you should see something like this if you were successful:

R version 3.5.0 (2018-04-23) -- "Joy in Playing" Copyright (C) 2018 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin15.6.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see http://www.gnu.org/licenses/.

C.2. Install Jupyter

Head to https://www.anaconda.com/download/#linux and download the Anaconda version for Linux with Python 3.6. Follow the instructions on that page to run the installer. Anaconda is a software bundle installs Jupyter and Python and a few other tools. We recommend using it because it makes installation much simpler than other methods.

C.3. Install the IR kernel

The IR kernel lets you connect R with Jupyter, to install it follow the instructions below:

  • Open terminal (how to video), and type R to open R.

  • Type the following commands into R:

install.packages('IRkernel', repos = 'http://cran.us.r-project.org')

IRkernel::installspec()

IRkernel::installspec(user = FALSE)

  • Next, exit R by typing: q() (type n when prompted to not save the workspace)

C.4 Test if it works:

  • Open terminal and type: jupyter notebook

The above command should open a web browser, with Jupyter's home inside it. Try creating a new R notebook and running some simple R code (e.g., print("hello!")) to test that it works.

Getting your files off of the Canvas JupyterHub

  1. On the Canvas JupyterHub open a terminal and type: tar -zcvf dsci-100.tar.gz dsci-100

  2. Go to the JupyterHub Control Panel/Home and click the box beside the file dsci-100.tar.gz, and then click "Download"

  3. Log onto either the https://ubc.syzygy.ca/ (CWL authentication) or https://cybera.syzygy.ca/ (Google authentication) servers

  4. open a terminal and type: tar -zxvf dsci-100.tar.gz

You should have your files now available on either of those servers above.

On Mac or Linux tar also works in terminal, and so you can use this strategy to have the files available locally, however you will need Jupyter installed to view them.