GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/tutorial_01/tutorial_01.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 1: Introduction to Data Science

Any place you see ..., you must fill in the function, variable, or data to complete the code. Replace fail() with your completed code and run the cell!

In [ ]:

### Run this cell before continuing. 
library(testthat)
library(digest)
library(repr)

Revision Question Match the following definitions with the corresponding functions used in R:

Definitions

A. Reads the most common types of flat file data, comma separated values and tab separated values, respectively.

B. Keeps only the variables you mention.

C. Applies linear filtering to a univariate time series or to each series separately of a multivariate time series.

D. Executes the transformations iteratively so that later transformations can use the columns created by earlier transformations.

E. Declares the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

F. Returns the first six rows or values of a vector, matrix, table, data frame or function.

Functions

ggplot
select
head
read_csv
mutate
filter

For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list of functions. For example: B <- 1

In [ ]:

# Assign your answer to a letter: A, B, C, D, E, F
# Make sure the correct answer is a numerical number from 1-6 
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(A), 'dbc09cba9fe2583fb01d63c70e1555a8') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(B), 'db8e490a925a60e62212cefc7674ca02') # we hid the answer to the test here so you can't see it, but we can still run the test)
    expect_equal(digest(C), '0aee9b78301d7ec8998971363be87c03') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(D), '5e338704a8e069ebd8b38ca71991cf94') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(E), '6717f2823d3202449301145073ab8719') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(F), 'e5b57f323c7b3719bbaaf9f96b260d39') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

1. Vickers and Vertosick Exercise

We hope you haven't forgotten about them just yet! As you might recall from lecture, Vickers and Vertosick were the researchers that wanted to study different factors affecting race performance of recreational runners. They assembled a data set that includes the age, sex, and BMI of runners, comparing it with their timed performance (how long it took them to complete either 5 or 10 km runs).

We will be continuing our analysis on their data and practice what you learnt during lecture. The goal for today, however, is to produce a plot of BMI against the time (in hours) it took for participants over the age of 30 to run 10 kms. To do this we will need to do the following:

use filter to subset the rows where age is greater than 30
use select to subset the bmi and km10_time_seconds columns
use mutate to convert 10 km race time from seconds (km10_time_seconds) to hours
use ggplot to create our plot of BMI and race time in hours

Hints for success: Try going through all the steps on your own, but don't forget to talk to others (classmates, TAs, Instructor) if you need help getting unstuck. Work with different functions and if something doesn't work out, read the error message or use the help() function. Since there are a lot of steps to working and modifying data, feel free to look back at worksheet_01.

Question 1.1 Multiple Choice:

After reading the text above (and remembering that filter lets us choose rows that have values at, above, or below a threshold), what column do you think we will be using for our threshold when we filter?

A. bmi

B. sex

C. age

D. km10_time_seconds

Assign your answer to an object called answer1. Make sure to write the uppercase letter for the answer you have chosen.

In [ ]:

# Assign your answer to an object called: answer1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer1), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.2 True or False:

We will be selecting the columns age and km10_time_seconds to plot.

Assign your answer to an object called answer2. Make sure to write in all lower-case.

In [ ]:

# Assign your answer to an object called: answer2
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2), 'd2a90307aac5ae8d0ef58e2fe730d38b') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.3 Multiple Choice:

Select the answer with the correct order of functions that we will use to wrangle our data into a useable form for the plot we want to create.

A. filter, select, mutate

B. mutate, ggplot, select

C. mutate, read_csv, select

D. filter, aes, ggplot

Assign your answer to an object called answer3. Make sure to write the uppercase letter for the answer you have chosen.

In [ ]:

# Assign your answer to an object called: answer3
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer.

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer3), '75f1160e72554f4270c809f041c7a776') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.4

To work on the cells below, load the package "tidyverse".

In [ ]:

# Replace the fail() with your line of code. 
# If you have difficulty with loading this package: 
# Go back to Worksheet1 and read over Section 5 (Packages)

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_that("package:tidyverse" %in% search() , is_true())
})
print("Success!")

Question 1.5

With the proper package running, you can now load the data - replace fail() with the correct function. Assign your data to marathon_small.

In [ ]:

# marathon_small <- ...("marathon_small.csv")
# Take the line above and fill in the ...
# Once finished, copy and replace the fail(). 
# As shown in the first line, remember to name your answer: marathon_small

# your code here
fail() # No Answer - remove if you provide an answer
head(marathon_small)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(sum(marathon_small$age)), 'eb82e1da6f1c13c4b76267e194a01953') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.6

Filter out and select the data such that information is only included from participants over the age of 30 and your data frame has only the columns needed for the plot.

Next, select the columns we wish to plot.

Hint: bmi is already given to you. What else do we want to plot?

In [ ]:

# marathon_age <- ... %>% 
# filter(... > 30) %>%
# ...(bmi, ...)
# marathon_age

# Take the code above and fill in the ...
# Once finished, copy and replace the fail(). 
# As shown in the first line, remember to name your answer: marathon_age

# your code here
fail() # No Answer - remove if you provide an answer
head(marathon_age)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(sum(marathon_age$bmi)), 'e032afe3bf85d19224006a8a74acea85')
    expect_equal(digest(as.integer(sum(marathon_age$km10_time_seconds, na.rm = TRUE))), 'c231508afddcd2e1b870f15641c9bac3')
    expect_equal(digest(nrow(marathon_age)), 'fda388229c2b0d97156970a7fda5a528')
    expect_equal(digest(ncol(marathon_age)), 'c01f179e4b57ab8bd9de309e6d576c48') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.7

Mutate the data frame to create a new column called: km10_time_hours.

Note: we will be selecting once again which specific columns we want to include in our data frame.

In [ ]:

# marathon_mutate <- ... %>%
# mutate(km10_time_hours = .../...) %>%
# select(bmi, km10_time_hours)
# marathon_mutate

# Take the code above and fill in the ...
# Once finished, copy and replace the fail(). 
# As shown in the first line, remember to name your answer: marathon_mutate

# your code here
fail() # No Answer - remove if you provide an answer
head(marathon_mutate)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(sum(marathon_mutate$bmi)), 'e032afe3bf85d19224006a8a74acea85')
    expect_equal(digest(sum(marathon_mutate$km10_time_hours, na.rm = TRUE)), '4dd096aa726576848a16663e5ad0b682')
    expect_equal(digest(nrow(marathon_mutate)), 'fda388229c2b0d97156970a7fda5a528')
    expect_equal(digest(ncol(marathon_mutate)), 'c01f179e4b57ab8bd9de309e6d576c48') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.8

Lastly, generate a scatter plot. Be smart in choosing your axes. If you have trouble remembering the code to create a graph, go to worksheet_01 and read over Graphing. Assign your plot to an object calles marathon_plot. Label your axes in a human readable way (do not leave them as default column names).

In [ ]:

# code to set-up plot size
library(repr)
options(repr.plot.width=4, repr.plot.height=3)

In [ ]:

# marathon_plot <- marathon_mutate %>%
#   ...(aes(x = ..., y = ...)) + 
#   ..._point() + 
#   xlab(...) + 
#   ...(...)

# Take the code above and fill in the ...
# Once finished, copy and replace the fail(). 
# As shown in the first line, remember to name your plot: marathon_plot


# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

# Run this cell to see what your scatterplot looks like!
# Or delete this cell and add marathon_plot under your code in the cell above. 

marathon_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_that("bmi" %in% c(rlang::get_expr(marathon_plot$mapping$x),rlang::get_expr(marathon_plot$layers[[1]]$mapping$x)), is_true())
    expect_that("km10_time_hours" %in% c(rlang::get_expr(marathon_plot$mapping$y), rlang::get_expr(marathon_plot$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(marathon_plot$layers[[1]]$geom)) , is_true())
    })
print("Success!")

Question 1.9

Do you see any pattern in the relationship between BMI and 10 km race time?

YOUR ANSWER HERE

Question 1.10

Now explore the relationship between the age of all runners and the time taken to complete the 10k run (in hours again). Do this by creating a scatter plot (similar to the one in Question 1.9).

There is a lot missing from the cell below (no hints were given). Try looking at earlier questions in this tutorial or worksheet_01 to get you started.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 1.11

Do you see any pattern in the relationship you explored in Question 1.10? Explain in written english.

YOUR ANSWER HERE

Climate change, and solutions to mititgate it, is currently on the tongues and minds of many people. One healthy and environmentally friendly transporation alternative that has been recently gaining popularity is bike sharing. Apart from their extensive real world applications in improving health and creating more climate-friendly transit, the data being generated by these bike sharing systems makes them great for research. In contrast to bus and subway transit systems, bikeshare transist systems precisely document where a trip starts, ends and how long it lasts, for each individual using the system. This level of individual tracibility may allow for better detection of mobility patterns in cities, as well as the potential detection of important events.

Today, we will be analyzing data obtained from Capital Bikeshare (data source), a bike sharing system from Washington, DC. The temperature data (in units of degrees Celcius) has been normalized from the original range so that all values are within the range of 0 and 1 (a common data processing technique helpful for some machine/statistical learning tools). Our goal is to figure out the relationship between temperature and the amount of people renting bikes during the Spring (March 20th - June 21st).

Question 2.1 Multiple Choice:

In comparison to bikes, why aren't other modes of transportation as useful when it comes to acquiring data?

A. Not as fast.

B. Documentation isn't as precise.

C. Not as environmentally friendly.

D. Bus drivers don't cooperate.

Assign your answer to an object called: answer2.1. Make sure the correct answer is an uppercase letter.

In [ ]:

# Assign your answer to an object called: answer2.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer.

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.1), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.2 Multiple Choice:

What are the units for the normalized temperature?

A. Kelvin

B. Fahrenheit

C. Celcius

Assign your answer to an object called: answer2.2. Make sure the correct answer is an uppercase letter.

In [ ]:

# Assign your answer to an object called: answer2.2
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer.

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.2), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.3

Since we already have tidyverse loaded and ready to use, the first step is to read our new data. Add in the missing function and symbol to complete the cell below. Make sure to assign your answer to bike_data.

In [ ]:

# bike_data ... ...("bike_share.csv")

# Take the code above and fill in the ...
# Once finished, copy and replace fail(). 
# As shown in the first line, remember to name your answer: bike_data

# your code here
fail() # No Answer - remove if you provide an answer
head(bike_data)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(ncol(bike_data)), '234a2a5581872457b9fe1187d1616b13')
    expect_equal(digest(nrow(bike_data)), '53d432e634be55f2bdb507cec34513e4')
    expect_equal(digest(sum(bike_data$temperature)), '8ff06198150ebcf625936e2f6ce48e1e')
    
})
print("Success!")

Question 2.4

Mutate the data such that you have a new column called total_users. This column would be the sum of the casual_users and the registered_users.

In [ ]:

# bike_mutate <- bike_data %>%
#    ...
#    bike_mutate

# Take the code above and fill in the ...
# Once finished, copy and replace fail().
# As shown in the first line, remember to name your answer: bike_mutate

# your code here
fail() # No Answer - remove if you provide an answer
head(bike_mutate)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.integer(sum(bike_mutate$total_users))), 'ca696c077151dc0a05b3e3862ab38f52')
    expect_equal(digest(nrow(bike_mutate)), '53d432e634be55f2bdb507cec34513e4')
    expect_equal(digest(ncol(bike_mutate)), 'dd4ad37ee474732a009111e3456e7ed7') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.5

Filter out the data to include information about rentals that were only made during Spring. Name your answer bike_filter.

In [ ]:

# bike_filter <- ... %>%
#   ...(... == "Spring")
#   bike_filter

# Take the code above and fill in the ...
# Once finished, copy and replace the fail(). 
# As shown in the first line, remember to name your answer: bike_filter

# your code here
fail() # No Answer - remove if you provide an answer
head(bike_filter)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.integer(sum(bike_filter$total_users))), '051a1e8b9293438bbc0cb8ed6fa4e959')
    expect_equal(digest(nrow(bike_filter)), 'f50e683f6447aa4f0cbaaf6862f27934')
    expect_equal(digest(ncol(bike_filter)), 'dd4ad37ee474732a009111e3456e7ed7') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.6

Select data from the columns that we wish to plot.

Hint: if you have forgetten, scroll up and re-read the introduction to this exercise. Name your answer bike_select.

In [ ]:

# bike_select <- bike_filter %>%
#    ... 
#    bike_select

# Take the code above and fill in the ...
# Once finished, copy and replace the fail(). 
# As shown in the first line, remember to name your answer: bike_select

# your code here
fail() # No Answer - remove if you provide an answer
head(bike_select)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(sum(bike_select$temperature)), '8aa81dfff0feb96a14f894cd25b8de79')
    expect_equal(digest(as.integer(sum(bike_select$total_users))), '051a1e8b9293438bbc0cb8ed6fa4e959')
    expect_equal(digest(nrow(bike_select)), 'f50e683f6447aa4f0cbaaf6862f27934')
    expect_equal(digest(ncol(bike_select)), 'c01f179e4b57ab8bd9de309e6d576c48') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.7

Plot the data as a scatter plot.

There is a lot missing from the cell below (no hints were given). Try completing this on your own before looking at Exercise 1 of this tutorial or worksheet_01. Assign your plot to an object called bike_plot_spring.

Hint: what do you think should be the x-axis / y-axis? Don't forget to label your axes!

In [ ]:

# Replace the fail() with your line of code (answer). 
# As shown in the first line, remember to name your plot: bike_plot_spring
# Make sure to use xlab() and ylab() to label your axes. 

# your code here
fail() # No Answer - remove if you provide an answer
bike_plot_spring

In [ ]:

test_that('Solution is incorrect', {
    expect_that("temperature" %in% c(rlang::get_expr(bike_plot_spring$mapping$x), rlang::get_expr(bike_plot_spring$layers[[1]]$mapping$x)) , is_true())
    expect_that("total_users" %in% c(rlang::get_expr(bike_plot_spring$mapping$y), rlang::get_expr(bike_plot_spring$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(bike_plot_spring$layers[[1]]$geom)) , is_true())
    })
print("Success!")

Question 2.8

In one sentence, describe the trend of your scatterplot of the data plotted above for the spring season.

YOUR ANSWER HERE

We are going to continue working with this informative data set but modify it from Exercise 2. This part of the tutorial will focus on your understanding of how functions work and testing your practice of correctly filling in code to get the right output. No hints will be provided so you won't be seeing anymore .... The number of questions with autograding and tests has also been intentionally decreased.

Unlike Exercise 2, now we want to figure out the relationship between temperature and the amount of people renting bikes during Fall (September 22nd - December 21st).

Try completing this Exercise from start to finish without any outside help. If you are struggling with a particular question, look at Exercise 2 for assistance.

Question 3.1 Multiple Choice:

What column is going to be filtered in Exercise 3?

A. casual_users

B. season

C. temperature

D. total_users

Assign your answer to an object called answer_filter. Make sure to write a capital letter for the answer you have chosen.

In [ ]:

# Assign your answer to an object called: answer_filter
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer_filter), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.2

Remember, you already have tidyverse loaded and you already read in the data. The next step is to mutate the data such that we have information on all the users. Make sure to save your answer to an object called bike_mutated.

Depending on what you find efficient and easy, use pipe operators or multiple lines of code when needed.

In [ ]:

# Remember to name your answer: bike_mutated
# Replace fail() with your line of code. 

# your code here
fail() # No Answer - remove if you provide an answer
head(bike_mutated)

Question 3.3

Filter out the data to include information about rentals that were only made during Fall. Next, select for the columns we wish to plot. Name your answer bike_selected.

In [ ]:

# Remember to name your answer: bike_selected
# Replace fail() with your line of code. 

# your code here
fail() # No Answer - remove if you provide an answer
head(bike_selected)

Question 3.4

Plot the data as a scatter plot. Assign your plot to an object called bike_plot_fall.

In [ ]:

# Replace the fail() with your line of code (answer). 
# As shown in the first line, remember to name your plot: bike_plot_fall
# Label your x-axis: Temperature (Celsius)
# Label your y-axis: Total Users (Casual and Registered)

# your code here
fail() # No Answer - remove if you provide an answer
bike_plot_fall

Question 3.5

In one sentence, describe the trend of your scatterplot for the fall season.

YOUR ANSWER HERE

Question 3.6

Looking at the scatterplots for the spring and the fall seasons, what difference(s) do you see? Based on these two plots, what might you recommend to this company to increase their users?

YOUR ANSWER HERE

Tutorial 1: Introduction to Data Science

1. Vickers and Vertosick Exercise

Question 1.10

Question 1.11

Product

Resources

Company

Tutorial 1: Introduction to Data Science

1. Vickers and Vertosick Exercise

Question 1.10

Question 1.11

2. Bike Sharing

3. Bike Sharing Continued...