GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2021-fall/materials/tutorial_01/tutorial_01.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 1: Introduction to Data Science

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

use a Jupyter notebook to execute provided R code
edit code and markdown cells in a Jupyter notebook
create new code and markdown cells in a Jupyter notebook
load the tidyverse library into R
create new variables and objects in R using the assignment symbol
use the help and documentation tools in R
match the names of the following functions from the tidyverse library to their documentation descriptions:
- read_csv
- select
- mutate
- filter
- ggplot
- aes

Any place you see ..., you must fill in the function, variable, or data to complete the code. Replace fail() with your completed code and run the cell!

Reminder: All autograded questions (i.e., questions with tests) are worth 1 point and all hidden test and manually graded questions are worth 3 points.

In [ ]:

### Run this cell before continuing. 
library(repr)
options(repr.matrix.max.rows = 6)
source("tests_tutorial_01.R")
source("cleanup_tutorial_01.R")

Revision Question Match the following definitions with the corresponding functions used in R:
{points: 1}

Definitions

A. Reads the most common types of flat file data, comma separated values.

B. Keeps only the variables you mention.

C. Keeps only rows with entries satisfying some logical condition that you specify.

D. Adds a new variable to a data frame as a function of the old columns.

E. Declares the input data frame for a graphic and specifies the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

Functions

ggplot
select
filter
read_csv
mutate

For each definition, assign the integer corresponding to the correct function to the letter object associated with the defintion. For example:

B <- 1

Assign your answers to the objects A, B, C, D, and E. Your answers should each be a single integer.

In [ ]:

# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_revision()

1. Vickers and Vertosick Exercise

We hope you haven't forgotten about them just yet! As you might recall from lecture, Vickers and Vertosick were the researchers that wanted to study different factors affecting race performance of recreational runners. They assembled a data set that includes the age, sex, and Body Mass Index (BMI) of runners, comparing it with their timed performance (how long it took them to complete either 5 or 10 km runs).

We will be continuing our analysis of their data to practice what you learnt during the previous lecture. The goal for today, however, is to produce a plot of BMI against the time (in minutes) it took for participants under the age of 35 to run 5 kilometres. To do this, we will need to complete the following steps:

use filter to extract the rows where age is less than 35
use select to extract the bmi and km5_time_seconds columns
use mutate to convert 5 km race time from seconds (km5_time_seconds) to minutes
use ggplot to create our plot of BMI (x-axis) and race time in minutes (y-axis)

Tips for success: Try going through all of the steps on your own, but don't forget to discuss with others (classmates, TAs, or an instructor) if you get stuck. If something is wrong and you can't spot the issue, be sure to read the error message carefully. Since there are a lot of steps involved in working with data and modifying it, feel free to look back at worksheet_01 for assistance.

Question 1.1 Multiple Choice
{points: 1}

After reading the text above (and remembering that filter lets us choose rows that have values at, above, or below a threshold), what column do you think we will be using for our threshold when we filter?

A. age

B. km5_time_seconds

C. bmi

D. sex

Assign your answer to an object called answer1.1. Make sure to write the uppercase letter for the answer you have chosen and surround the letter with quotes.

In [ ]:

# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.1()

Question 1.2 True or False
{points: 1}

We will be selecting the columns age and km5_time_seconds to plot. True or false?

Assign your answer (of either "true" or "false") to an object called answer1.2. Make sure to write in all lower-case and surround your answer with quotes.

In [ ]:

# Make sure the correct answer is written in lower-case ("true" / "false")
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.2()

Question 1.3 Multiple Choice
{points: 1}

Select the answer with the correct order of functions that we will use to wrangle our data into a useable form for the plot we want to create.

A. mutate, select, filter

B. select, filter, aes

C. filter, select, mutate

D. filter, select, aes

E. select, filter, mutate

Assign your answer to an object called answer1.3. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# Replace the fail() with your answer.

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.3()

Question 1.4
{points: 1}

To work on the cells below, load the tidyverse package. If you have difficulty with loading this package, revisit worksheet_01 and read over Section 5 (Packages).

In [ ]:

# Replace the fail() with your line of code. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.4()

Question 1.5
{points: 1}

With the proper package loaded, you can now read in the data.

Replace fail() with the correct function. Assign your data to an object called marathon_small.

In [ ]:

# ... <- ...("marathon_small.csv")

# your code here
fail() # No Answer - remove if you provide an answer
marathon_small

In [ ]:

test_1.5()

Question 1.6
{points: 1}

filter and select the data (marathon_small) such that information is only included from participants under the age of 35 and only contains the columns needed for the plot.

Hint: bmi is already given to you. What else do we want to plot?

Name the result of filtering marathon_age, and name the result of selecting marathon_select.

In [ ]:

# ... <- filter(marathon_small, ... < 35)
# ... <- ...(marathon_age, bmi, ...)

# your code here
fail() # No Answer - remove if you provide an answer
marathon_age
marathon_select

In [ ]:

test_1.6()

Question 1.7
{points: 1}

Mutate the data frame (marathon_select) to create a new column called: km5_time_minutes.

Note: we will be selecting once again the specific columns we want to include in our data frame.

Name the result after creating the new column marathon_mutate, and name the result after selecting the columns used for plotting marathon_exact.

In [ ]:

# ... <- mutate(marathon_select, km5_time_minutes = ... / ...) 
# ... <- select(..., ..., km5_time_minutes)


# your code here
fail() # No Answer - remove if you provide an answer
marathon_mutate
marathon_exact

In [ ]:

test_1.7()

Question 1.8
{points: 1}

Lastly, generate a scatter plot. Assign your plot to an object called marathon_plot.

Ensure that your axis labels are human-readable (do not leave them as default column names).

In [ ]:

# run this cell 
# code to set-up plot size
library(repr)
options(repr.plot.width = 8, repr.plot.height = 8)

In [ ]:

#... <- ...(marathon_exact, aes(x = ..., y = ...)) + 
#   ..._point() + 
#   xlab(...) + 
#   ...(...)


# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

# Run this cell to see what your scatterplot looks like!
marathon_plot

In [ ]:

test_1.8()

Note: the warning message above tells us the number of rows that had missing data in the data set, and that these rows were not plotted. When you see something like this, you should stop and think, do I expect missing rows in my data? Sometimes the answer is yes, sometimes it is no. It depends on the data set, and you as the Data Scientist must know the answer to this. How would you determine the answer? By talking to those who collected the data and/or researching where the data came from, for example.

Question 1.9
{points: 3}

Which option below best describes the plot above?

A. For runners under the age of 35, there is no relationship at all between BMI and the time it takes to complete a 5 km race .

B. For runners under 35, we see that as BMI increases the time it takes to complete a 5 km race increases. This suggests that there is a positive relationship between these two variables for runners under 35 in this data set.

C. For runners under 35, we see that as BMI increases the time it takes to complete a 5 km decreases. This suggests that there is a negative relationship between these two variables for runners under 35 in this data set.

Assign your answer to an object called answer1.9. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

# Here we check whether you have the correct object name(s). However,
# all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that("Did not create an object named answer1.9", {
    expect_true(exists("answer1.9")) 
})

Question 1.10
{points: 1}

Now explore the relationship between the age of all runners and the time taken to complete the 5 km run (in minutes again). Using the original marathon_small data frame, mutate the km5_time_seconds column such that it is in minutes. Save the resulting data frame to an object called marathon_small_mins.

Next, create a scatter plot (similar to the one in Question 1.9) but this time have age on the x-axis. Assign your plot to an object called age_vs_time.

There is a lot missing from the cell below (no hints were given). Try working on it on your own before looking at earlier questions in this tutorial or worksheet_01.

Don't forget to label your axes! Where appropriate, axes labels should also include units (for example, the axis that maps to the column age should have the unit "years").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
age_vs_time

In [ ]:

test_1.10()

Question 1.11
{points: 3}

In the plot above, we can see a positive relationship between age and time taken to complete a 5 km run. Is this relationship strong (points are close together) or weak (points are more widely scattered)?

Assign your answer (either "weak" or "strong") to an object called answer1.11. Make sure to write in all lower-case and surround your answer with quotes.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

# Here we check whether you have the correct object name(s). However,
# all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that("Did not create an object named answer1.11", {
    expect_true(exists("answer1.11")) 
})

Climate change, and solutions to mitigate it, is currently on the tongues and minds of many people. One healthy and environmentally friendly transportation alternative that has been recently gaining popularity is bike-sharing. Apart from their extensive real-world applications in improving health and creating more climate-friendly transit, the data generated by these bike-sharing systems makes them great for research. In contrast to bus and subway transit systems, bike-share transit systems precisely document where a trip starts, where it ends, and how long it lasts, for each individual using the system. This level of individual traceability may allow for better detection of mobility patterns in cities and possible detection of important events.

Today, we will be analyzing data obtained from Capital Bikeshare, a bike-sharing system from Washington, DC. The temperature data (in units of degrees Celsius) has been normalized from the original range so that all values fall between 0 and 1 (a common data processing technique helpful for some machine/statistical learning tools). Our goal is to determine if there is a relationship between temperature and the number of people renting bikes during the Spring (March 20th - June 21st).

Question 2.1 Multiple Choice
{points: 1}

In comparison to bike-sharing systems, why aren't other modes of transportation as useful when it comes to acquiring data?

A. Not as fast.

B. Documentation isn't as precise.

C. Not as environmentally friendly.

D. Bus drivers don't cooperate.

Assign your answer to an object called: answer2.1. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# Replace the fail() with your answer.

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.1()

Question 2.2 Multiple Choice
{points: 1}

What are the units for the normalized temperature?

A. Kelvin

B. Fahrenheit

C. Celsius

Assign your answer to an object called: answer2.2. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# Replace the fail() with your answer.

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.2()

Question 2.3
{points: 1}

Since we already have tidyverse loaded and ready to use, the first step is to read our new data. Add in the missing function and symbol to complete the cell below. Make sure to assign your answer to bike_data.

In [ ]:

#... <- ...("bike_share.csv")

# your code here
fail() # No Answer - remove if you provide an answer
bike_data

In [ ]:

test_2.3()

Question 2.4
{points: 1}

Mutate the data such that you have a new column called total_users. This column would be the sum of the casual_users and the registered_users. Assign your answer to an object called bike_mutate.

In [ ]:

#... <- ...(bike_data, ...)

# your code here
fail() # No Answer - remove if you provide an answer
bike_mutate

In [ ]:

test_2.4()

Question 2.5
{points: 1}

Filter out the data to include information about rentals that were only made during Spring. Name your answer bike_filter.

In [ ]:

#... <- ...(bike_mutate, ... == "Spring")

# your code here
fail() # No Answer - remove if you provide an answer
bike_filter

In [ ]:

test_2.5()

Question 2.6
{points: 3}

Select the columns from the data that we wish to plot. Name your answer bike_select.

Hint: if you have forgetten, scroll up and re-read the introduction to this exercise.

In [ ]:

#... <- select(...)

# your code here
fail() # No Answer - remove if you provide an answer
bike_select

In [ ]:

# Here we check whether you have the correct object name(s). However,
# all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that("Did not create an object named bike_select", {
    expect_true(exists("bike_select")) 
})

Question 2.7
{points: 3}

Plot the data as a scatter plot.

There is a lot missing from the cell below (no hints were given). Try completing this on your own before looking back at any previous exercises. Assign your plot to an object called bike_plot_spring.

Hint: what do you think should be the x-axis / y-axis? Don't forget to label your axes! Where appropriate, axes labels should also include units (for example, the axis mapped to the temperature column should have the units "normalized degrees Celsius").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
bike_plot_spring

In [ ]:

# Here we check whether you have the correct object name(s). However,
# all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that("Did not create a plot named bike_plot_spring", {
    expect_true(exists("bike_plot_spring")) 
})

Question 2.8
{points: 3}

In 1-2 sentences, describe whether there is a relationship between the variables observed in the scatterplot of the data for the spring season. Comment on the direction and the strength of the relationship (if there is one), and how the variables change with respect to each other (if they do).

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

For this exercise, we are going to continue working with Capital Bikeshare dataset. This part of the tutorial will focus on your understanding of how the functions work and test your ability to write code without hints. Note that we have also intentionally decreased the number of auto-graded questions for the remainder of the tutorial.

Unlike the previous exercise, we now want to determine if there is a relationship between temperature and the amount of people renting bikes during Fall (September 22nd - December 21st).

Try completing this Exercise from start to finish without any outside help. If you are struggling with a particular question, look at Exercise 2 for assistance.

Question 3.1 Multiple Choice
{points: 1}

Which column is going to be filtered during this exercise?

A. casual_users

B. season

C. temperature

D. total_users

Assign your answer to an object called answer3.1. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# Replace fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_3.1()

Question 3.2
{points: 3}

Recall that the tidyverse package has loaded and the data has already been read. The next step is to mutate the data such that we have information on all the users. Make sure to save your answer to an object called bike_mutated, and make sure to create a column called total_users.

In [ ]:

# Replace fail() with your line of code. 

# your code here
fail() # No Answer - remove if you provide an answer
bike_mutated

In [ ]:

# Here we check whether you have the correct object name(s). However,
# all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that("Did not create an object named bike_mutated", {
    expect_true(exists("bike_mutated"))
    expect_true("total_users" %in% colnames(bike_mutated)) 
})

Question 3.3
{points: 3}

Filter out the data to include information about rentals that were only made during Fall - assign this data frame to an object called bike_filtered. Next, select for the columns we wish to plot. Name your answer bike_selected.

In [ ]:

# Replace fail() with your line of code. 

# your code here
fail() # No Answer - remove if you provide an answer
bike_selected

In [ ]:

# Here we check whether you have the correct object name(s). However,
# all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that("Did not create an object named bike_filtered", {
    expect_true(exists("bike_filtered")) 
})
test_that("Did not create an object named bike_selected", {
    expect_true(exists("bike_selected")) 
})

Question 3.4
{points: 3}

Plot the data as a scatter plot. Label your x-axis: Temperature (normalized degrees Celsius) and your y-axis: Total Users (Casual and Registered). Assign your plot to an object called bike_plot_fall.

In [ ]:

# Replace the fail() with your line of code (answer). 

# your code here
fail() # No Answer - remove if you provide an answer
bike_plot_fall

In [ ]:

# Here we check whether you have the correct object name(s). However,
# all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that("Did not create an object named bike_plot_fall", {
    expect_true(exists("bike_plot_fall")) 
})

Question 3.5
{points: 3}

In one sentence, describe whether there is a relationship observed in the scatter plot for the fall season, and if so, the direction of that relationship.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 3.6
{points: 3}

Looking at the scatter plots for the spring and the fall seasons, what difference(s) do you see? Based on these two plots, what might you recommend to this company to increase their users?

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

In [ ]:

source("cleanup_tutorial_01.R")

Tutorial 1: Introduction to Data Science

Lecture and Tutorial Learning Goals:

1. Vickers and Vertosick Exercise

Product

Resources

Company

Tutorial 1: Introduction to Data Science

Lecture and Tutorial Learning Goals:

1. Vickers and Vertosick Exercise

2. Bike-Sharing

3. Bike-Sharing Continued