Path: blob/master/2019-spring/materials/worksheet_12/worksheet_12.ipynb
2051 views
Last worksheet!
We will cover 2 things today:
Visualizing high dimensional data
One of the problems that we encounter when doing data science is the high dimensionality of our datasets. We have multiple variables that we wish to look at the same time to understand the processes and patterns going on in there. However, we can only see up to 3 dimensions (really only 2 well, so even when we get to a dataset like the iris dataset, which only contains 4 variables aside from the species labels, we can't fully visualize what's going on. One solution is to visualize it across multiple 2D plots (as you did in the last tutorial) but that can be sub-optimal. Is there nothing else we can do? No! There are alternatives. In general the broad category of algorithms used for this and other problems of high dimensionality is named dimensionality reduction. One of the most famous and used techniques is called t-SNE. (
In this worksheet, we will demonstrate how you can do t-sne in R to visualize the high-dimensional data sets you were working with in your clustering tutorial. We do not expect you to be able to explain how this algorithm works, but we we do expect you to learn and remember is the following:
what t-sne is used for (the problem it is solving)
how to use R to perform t-sne
Additional videos/readings (beyond the scope of this course):
An illustrated introduction to the t-SNE algorithm - note - this is in Python!
Travel Reviews
We will be working with the travel reviews from East Asia again. During the tutorial we found 2 clusters to be the ideal number of clusters. However, we couldn't visualize how did our clusters looked. In this exercise we will apply t-SNE to the reviews dataset to be able to visualize our clusters. The code supplied below replicates the analysis we completed in the tutorial.
Question 1.0
Install and load the Rtsne
package.
Question 1.1
When doing dimensionality reduction one of the things we need to decide is how many dimentions do we want to reduce to? In this case we will be reducing to 2 dimensions, although with t-SNE it is possible to reduce up to 3 dimensions.
Your tasks:
Set the seed to be 2019.
Apply t-SNE to the
clean_reviews
dataset, reducing to 2 dimensions.Assing the result to an object named
tsne_reviews
.
The function you will need to do this is called Rtsne
. You will need to give 3 at least these 3 arguments:
X
(the data set you want to perform dimensionality reduction on)dims
the number of dimensions you want to reduce the data set to (here 2)check_duplicates
(this specifies to checks to see if there are duplicates are present in the data set. The default is TRUE, but it is best practice set it to FALSE to ensure there are no duplicates present).note - a random number process is used in the algorithm, and so you should set a seed to make your analysis reproducible
Question 1.2
tsne_reviews
contains a matrix with the lower dimensionality representation of the original points. This is contained inside of the Y
attribte of the object returned from the Rtsne
function. Your task is to extract this data from the more complex object so you can easily use it to get the cluster assignments. Name this new object tsne_results
.
Question 1.3
We can use the augment
function (from the broom
package) to retrieve the kmeans cluster assignments given to you in the first snippets of code. Extract these from the reviews_clusters
object (which was the object returned from the kmeans
function) and assign these to an object called tsne_clusters
.
Question 1.4
Plot the data points (columns X1
and X2
) from your tsne_clusters
object (this is a reduced dimensionality representation of the original data set). Color the data points by their assigned cluster (.cluster
column). Assign the plot to an object named tsne_plot
.
Question 1.4.1 (Optional)
Another argument you can set when doing t-sne is perplexity. Distances between nearest neighbors is part of the The t-sne algorithm. You can loosely think of perplexity as a way of specifying the number of nearest neighbors to use. A small value for perplexity (e.g., 5) corresponds to a smaller number of nearest neighbours (paying more attention to local structure), where a larger values suggests we should use a larger number of nearest neighbours (paying more attention to the global structure). The default for R's Rtsne
function is 30.
Try running t-sne again and visualizing it twice more, once with perplexity = 3
and once with perplexity = 150
. Set the seed to be 2019.
Question 1.4.2
How do we choose perplexity (or other such values) for t-sne? Typically we try a variety of values and look at the resultant visualization and choose the one that gives the best clustering visually.
For the 3 perplexities we tried, which one do you think is the best? Explain why.
YOUR ANSWER HERE
Question 1.5 (Optional - not graded)
Now that you can visualize the cluster assignations, does this change your interpretation of the kmeans results you obtained in the tutorial? Does it weaken or strengthen your opinion that this data can be separated into two clusters?
YOUR ANSWER HERE
Question 1.6 (Optional - not graded)
Use t-sne to visualize the pokemon data set in two dimensions, overlaying the k-means cluster assignments on the visualization.
Getting R + Jupyter working for you outside of our course
At some point after the exam, you will lose access to the JupyterHub server where you have been doing your course work. If you want to continue to use R + Jupyter (for other courses at UBC, or for your work after UBC) you have two options:
a server solution
a local installation solution
We will point you to how you can do both, as well as provide guidance to take a copy of your homework from our Canvas JupyterHub server.
1. a server solution
As a student at UBC, you have access to another JupyterHub that you can access using your UBC CWL: https://ubc.syzygy.ca/
If you have a Google account, you have access to another JupyterHub that does not depend on you being a UBC student (i.e., having a valid CWL): https://cybera.syzygy.ca/
2. a local installation solution
Depending on your device, you may be able to install Jupyter and R on it. Below we provide instructions for the 3 major operating systems (Mac, Windows and Linux)
A. Install Jupyter + R on Mac
A.1. Install R
Go to https://cran.r-project.org/bin/macosx/ and download the latest version of R for Mac (Should look something like this: R-3.5.1.pkg). Open the file and follow the installer instructions.
A.2. Install Jupyter
Head to https://www.anaconda.com/download/#macos and download the Anaconda version for Mac OS. Follow the instructions on that page to run the installer. Anaconda is a software bundle installs Jupyter and Python and a few other tools. We recommend using it because it makes installation much simpler than other methods.
A.3. Install the IR kernel
The IR kernel lets you connect R with Jupyter, to install it follow the instructions below:
Open terminal (how to video), and type
R
to open R.Type the following commands into R:
install.packages('IRkernel', repos = 'http://cran.us.r-project.org')
IRkernel::installspec()
IRkernel::installspec(user = FALSE)
Next, exit R by typing:
q()
(typen
when prompted to not save the workspace)
A.4 Test if it works:
Open terminal and type:
jupyter notebook
The above command should open a web browser, with Jupyter's home inside it. Try creating a new R notebook and running some simple R code (e.g., print("hello!")
) to test that it works.
B. Install Jupyter + R on Windows
B.1. Install R
Go to https://cran.r-project.org/bin/windows/base/ and download the latest version of R for Windows (Should look something like this: Download R 3.5.1 for Windows). Open the file and follow the installer instructions.
B.2. Install Jupyter
Head to https://www.anaconda.com/download/#windows and download the Anaconda version for Windows with Python 3.6. After the download has finished, run the installer selecting the following options:
On the Advanced Installation Options page, check both boxes (see image below)
For all other pages, use the default options
B.3. Install the IR kernel
The IR kernel lets you connect R with Jupyter, to install it follow the instructions below:
Open R (double click on Desktop Icon or open it using the START menu)
Type the following commands into R:
install.packages('IRkernel', repos = 'http://cran.us.r-project.org')
IRkernel::installspec()
IRkernel::installspec(user = FALSE)
Next, exit/close R
B.4 Test if it works:
Open Windows Command Prompt (select the Start button, type cmd, and click Command Prompt from the list) and type:
jupyter notebook
The above command should open a web browser, with Jupyter's home inside it. Try creating a new R notebook and running some simple R code (e.g., print("hello!")
) to test that it works.
C. Install Jupyter + R on Linux
C.1. Install R
Open /etc/apt/sources.list and add the following line to the end of the file (choose the correct one for your version of Ubuntu):
for Ubuntu 18.04.1 (Bionic Beaver) add:
deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/
for Ubuntu 17.10.1 (Artful Aardvark) add:
deb https://cloud.r-project.org/bin/linux/ubuntu artful/
for Ubuntu 16.04.4 (Xenial Xerus) add:
deb https://cloud.r-project.org/bin/linux/ubuntu xenial/
for Ubuntu 14.04.5 (Trusty Tahr) add:
deb https://cloud.r-project.org/bin/linux/ubuntu trusty/
Next, add the key ID for the CRAN network:
Then, update the repository:
Then, install the R binaries:
Finally, install the r-base-dev package:
After installation, in terminal type the following to ask for the version:
you should see something like this if you were successful:
C.2. Install Jupyter
Head to https://www.anaconda.com/download/#linux and download the Anaconda version for Linux with Python 3.6. Follow the instructions on that page to run the installer. Anaconda is a software bundle installs Jupyter and Python and a few other tools. We recommend using it because it makes installation much simpler than other methods.
C.3. Install the IR kernel
The IR kernel lets you connect R with Jupyter, to install it follow the instructions below:
Open terminal (how to video), and type
R
to open R.Type the following commands into R:
install.packages('IRkernel', repos = 'http://cran.us.r-project.org')
IRkernel::installspec()
IRkernel::installspec(user = FALSE)
Next, exit R by typing:
q()
(typen
when prompted to not save the workspace)
C.4 Test if it works:
Open terminal and type:
jupyter notebook
The above command should open a web browser, with Jupyter's home inside it. Try creating a new R notebook and running some simple R code (e.g., print("hello!")
) to test that it works.
Getting your files off of the Canvas JupyterHub
On the Canvas JupyterHub open a terminal and type:
tar -zcvf dsci-100.tar.gz dsci-100
Go to the JupyterHub Control Panel/Home and click the box beside the file
dsci-100.tar.gz
, and then click "Download"Log onto either the https://ubc.syzygy.ca/ (CWL authentication) or https://cybera.syzygy.ca/ (Google authentication) servers
open a terminal and type:
tar -zxvf dsci-100.tar.gz
You should have your files now available on either of those servers above.
On Mac or Linux tar
also works in terminal, and so you can use this strategy to have the files available locally, however you will need Jupyter installed to view them.