Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/slides/12_high_dim_viz_and_wrap_up.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 12 - Visualizing high dimensional data & Data Science wrap-up

2019-04-04

Output of K means multivariate clustering from tutorial

K-means clustering with 3 clusters of sizes 123, 389, 288 Cluster means: Total HP Attack Defense Sp. Atk Sp. Def Speed 1 622.5691 88.91057 117.72358 100.65854 116.33333 101.86179 97.08130 2 472.9666 77.19280 85.30077 80.95373 77.54499 79.02057 72.95373 3 303.8958 50.14931 53.95486 52.78472 47.85417 49.49306 49.65972 Clustering vector: [1] 3 2 2 1 3 2 2 1 1 3 2 2 1 3 3 2 3 3 2 2 3 3 2 1 3 2 3 2 3 2 3 2 3 2 3 3 2 [38] 3 3 2 3 2 3 2 3 2 3 2 3 2 2 3 2 3 2 3 2 3 2 3 2 3 2 3 1 3 3 2 3 2 2 1 3 2 [75] 2 3 2 2 3 2 3 2 2 2 2 3 2 1 3 2 3 3 2 3 2 3 2 3 2 3 2 2 1 3 3 2 3 2 3 2 3 [112] 2 3 2 2 2 3 3 2 3 2 2 2 2 1 3 2 3 2 3 2 2 2 2 2 2 2 1 2 3 2 1 2 3 3 2 2 2 [149] 2 3 2 3 2 2 1 2 1 1 1 3 2 1 1 1 1 1 3 2 2 3 2 2 3 2 2 3 2 3 2 3 2 3 2 2 3 [186] 2 3 3 3 3 2 3 2 3 3 2 1 2 3 2 2 2 3 3 2 3 3 2 2 3 2 2 2 2 2 2 3 2 2 3 2 2 [223] 2 2 1 3 2 2 2 1 2 2 1 2 3 2 3 2 3 2 3 3 2 3 2 2 3 2 1 2 3 2 2 2 3 3 2 3 3 [260] 3 2 2 1 1 1 3 2 1 1 1 1 1 3 2 2 1 3 2 2 1 3 2 2 1 3 2 3 2 3 3 2 3 3 3 3 2 [297] 3 3 2 3 2 3 2 3 3 2 1 3 2 3 2 3 2 1 3 2 3 3 3 2 3 2 3 3 3 3 3 2 3 2 3 2 2 [334] 1 3 2 2 3 2 1 2 2 2 2 2 3 2 3 2 1 2 2 3 2 1 2 3 2 3 3 3 2 3 2 3 2 1 2 2 2 [371] 2 3 2 3 2 3 2 3 2 3 2 3 2 2 2 3 2 1 3 2 2 2 2 1 3 3 2 1 3 2 2 3 2 2 2 3 3 [408] 2 1 1 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 2 3 2 2 3 2 2 3 3 2 [445] 3 2 3 3 3 3 2 3 2 3 2 3 2 3 2 2 2 2 3 2 2 3 2 3 2 3 2 2 3 2 3 2 1 2 2 3 2 [482] 3 3 2 3 2 3 3 3 2 2 3 2 1 1 2 3 2 1 3 2 3 2 3 2 2 3 2 3 3 2 1 2 2 2 2 2 2 [519] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 3 [556] 2 2 3 2 2 3 2 2 3 2 3 3 2 3 2 3 2 3 2 3 2 3 2 3 3 2 3 2 3 2 2 3 2 3 2 2 2 [593] 3 2 2 3 3 2 2 2 3 3 2 3 3 2 3 2 3 2 2 3 3 2 3 2 2 2 3 2 3 2 2 3 2 3 2 2 1 [630] 3 2 3 2 3 2 3 2 2 3 3 2 3 2 3 2 2 3 2 2 3 2 3 2 3 2 2 3 2 3 2 3 2 2 3 2 2 [667] 3 2 3 3 2 3 2 2 3 2 2 3 2 2 3 2 2 3 2 3 2 2 3 2 3 2 2 2 3 2 1 3 1 1 1 1 1 [704] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 2 3 2 2 3 2 2 3 2 3 3 2 3 3 2 3 2 3 3 1 [741] 3 2 3 2 2 3 2 2 3 2 2 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 2 2 2 2 3 2 1 [778] 2 3 2 3 3 3 3 2 2 2 2 3 2 3 2 1 1 1 1 1 1 1 1 Within cluster sum of squares by cluster: [1] 908309.6 2188797.5 1152053.0 (between_SS / total_SS = 73.1 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault"

Output of K means multivariate clustering from tutorial

  • we can look at total within sum of squares (but really only useful for comparing models)

  • we can look at the ratio of between sum of squares / total sum of squares

    • if very small, then there are no discernable clusters

    • if 100, then each point is its own cluster

neither of these are very intuitive (at least to me)

What is intuitive?

Visualization! A picture says 1000 words!

t-sne

  • a popular dimensonality reduction algorithm useful for visualizing multi-dimensional data sets

  • no "model" given from t-sne (only works to visualize the data you currently have)

  • see links in worksheet for more details about the specifics of the algorithm if you are interested

t-sne visualization of gene expression data from cells in a region of the brain

  • each data point in this picture corresponds to a single brain cell for which we have the expression level measurements for thousands of genes.

source: Cembrowski, M.S., Wang, L., Lemire, A., DiLisio, S.F., Copeland, M., Clements, J., Spruston, N. The subiculum is a patchwork of discrete subregions. eLife 7, doi:10.7554/eLife.37701, 2018.

t-sne visualization of hand-written digits data set overlaid with class identification

  • each data point is an image of a handwritten digit for which we have 784 pixel values

COURSE EVALUATIONS!

Data Science wrap-up

In January, we started with this Gif

And we laid out these goals and this path:

High-level goals of this course:

  1. Learn how to use reproducible tools (Jupyter + R) to do data analysis

  1. Learn how to solve 3 common problems in Data Science

Problems we will focus on:

  1. Predict a class/category for a new observation/measurement (e.g., cancerous or benign tumour)

  1. Predict a value for a new observation/measurement (e.g., 10 km race time for a 35 year old with a BMI of 25).

  1. Find previously unknown/unlabelled subgroups in your data (e.g., products commonly bought together on Amazon)

Another way to think of what we did in this course:

source: R for Data Science by Grolemund & Wickham

Where to from here

  • you learned a lot in this course!

  • many of you are asking for more Data Science (yeah!)

Thank-you and it's been a blast!