Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/materials/tutorial_01/tutorial_01.ipynb
2051 views
Kernel: R

Tutorial 1: Introduction to Data Science

Any place you see ..., you must fill in the function, variable, or data to complete the code. Replace fail() with your completed code and run the cell!

Reminder: All autograded questions (i.e., questions with tests) are worth 1 point and all hidden test and manually graded questions are worth 3 points.

### Run this cell before continuing. library(repr) source("tests_tutorial_01.R")

Revision Question Match the following definitions with the corresponding functions used in R:
{points: 1}

Definitions

A. Reads the most common types of flat file data, comma separated values and tab separated values, respectively.

B. Keeps only the variables you mention.

C. Keeps only rows with entries satisfying some logical condition that you specify.

D. Adds a new variable to a data frame as a function of the old columns.

E. Declares the input data frame for a graphic and specifies the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

F. Returns the first six rows or values of a vector, matrix, table, data frame or function.

Functions

  1. ggplot

  2. select

  3. head

  4. read_csv

  5. mutate

  6. filter

For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list of functions. For example: B <- 1

# Assign your answer to a letter: A, B, C, D, E, F # Make sure the correct answer is a numerical number from 1-6 # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_revision()

1. Vickers and Vertosick Exercise

We hope you haven't forgotten about them just yet! As you might recall from lecture, Vickers and Vertosick were the researchers that wanted to study different factors affecting race performance of recreational runners. They assembled a data set that includes the age, sex, and BMI of runners, comparing it with their timed performance (how long it took them to complete either 5 or 10 km runs).

We will be continuing our analysis on their data to practice what you learnt during lecture. The goal for today, however, is to produce a plot of BMI against the time (in minutes) it took for participants under the age of 35 to run 5 kilometres. To do this we will need to do the following:

  1. use filter to extract the rows where age is less than 35

  2. use select to extract the bmi and km5_time_seconds columns

  3. use mutate to convert 5 km race time from seconds (km5_time_seconds) to minutes

  4. use ggplot to create our plot of BMI (x-axis) and race time in minutes (y-axis)

Hints for success: Try going through all the steps on your own, but don't forget to talk to others (classmates, TAs, Instructor) if you need help getting unstuck. Work with different functions and if something doesn't work out, read the error message or use the help() function. Since there are a lot of steps to working and modifying data, feel free to look back at worksheet_01.

Question 1.1 Multiple Choice
{points: 1}

After reading the text above (and remembering that filter lets us choose rows that have values at, above, or below a threshold), what column do you think we will be using for our threshold when we filter?

A. age

B. km5_time_seconds

C. bmi

D. sex

Assign your answer to an object called answer1. Make sure to write the uppercase letter for the answer you have chosen and surround the letter with quote.

# Assign your answer to an object called: answer1 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_1.1()

Question 1.2 True or False
{points: 1}

We will be selecting the columns age and km5_time_seconds to plot. True or false?

Assign your answer (of either true or false) to an object called answer2. Make sure to write in all lower-case and surround the letter with quotes.

# Assign your answer to an object called: answer2 # Make sure the correct answer is written in lower-case (true / false) # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_1.2()

Question 1.3 Multiple Choice
{points: 1}

Select the answer with the correct order of functions that we will use to wrangle our data into a useable form for the plot we want to create.

A. mutate, ggplot, select

B. mutate, read_csv, select

C. filter, select, mutate

D. filter, aes, ggplot

Assign your answer to an object called answer3. Make sure to write the uppercase letter for the answer you have chosen and surround the letter with quote.

# Assign your answer to an object called: answer3 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_1.3()

Question 1.4
{points: 1}

To work on the cells below, load the package "tidyverse".

# Replace the fail() with your line of code. # If you have difficulty with loading this package: # Go back to Worksheet1 and read over Section 5 (Packages) # your code here fail() # No Answer - remove if you provide an answer
test_1.4()

Question 1.5
{points: 1}

With the proper package running, you can now load the data.
Replace fail() with the correct function. Assign your data to an object called marathon_small.

# marathon_small <- ...("marathon_small.csv") # Take the line above and fill in the ... # Once finished, copy and replace the fail(). # As shown in the first line, remember to name your answer: marathon_small # your code here fail() # No Answer - remove if you provide an answer head(marathon_small)
test_1.5()

Question 1.6
{points: 1}

Filter out and select the data such that information is only included from participants under the age of 35 and your data frame only contains the columns needed for the plot.

Hint: bmi is already given to you. What else do we want to plot?

# marathon_age <- filter(marathon_small, ... < 35) # marathon_select <- ...(marathon_age, bmi, ...) # Take the code above and fill in the ... # Once finished, copy and replace the fail(). # As shown in the second line, remember to name your answer: marathon_select # your code here fail() # No Answer - remove if you provide an answer head(marathon_age) head(marathon_select)
test_1.6()

Question 1.7
{points: 1}

Mutate the data frame to create a new column called: km5_time_minutes.

Note: we will be selecting once again the specific columns we want to include in our data frame.

# marathon_mutate <- mutate(marathon_select, km5_time_minutes = .../...) # marathon_exact <- select(..., ..., km5_time_minutes) # Take the code above and fill in the ... # Once finished, copy and replace the fail(). # As shown in the second line, remember to name your answer: marathon_exact # your code here fail() # No Answer - remove if you provide an answer head(marathon_mutate) head(marathon_exact)
test_1.7()

Question 1.8
{points: 1}

Lastly, generate a scatter plot. Assign your plot to an object called marathon_plot.
Label your axes in a human readable way (do not leave them as default column names).

# code to set-up plot size library(repr) options(repr.plot.width=3, repr.plot.height=3)
# marathon_plot <- ...(marathon_exact, aes(x = ..., y = ...)) + # ..._point() + # xlab(...) + # ...(...) # Take the code above and fill in the ... # Once finished, copy and replace the fail(). # As shown in the first line, remember to name your plot: marathon_plot # your code here fail() # No Answer - remove if you provide an answer
# Run this cell to see what your scatterplot looks like! # Or delete this cell and add marathon_plot under your code in the cell above. marathon_plot
test_1.8()

note - the warning above simply tells us the number of rows that had missing data in the data set, and that these rows were not plotted. When you see something like this, you should stop and think, do I expect missing rows in my data? Sometimes the answer is yes, sometimes it is no. It depends on the data set and you as the Data Scientist need to know the answer to this. How do you know the answer? By talking to those who collected the data and/or researching where the data came from, for example.

Question 1.9
{points: 3}

What sentences below best describes the plot above? One or more may be correct.

A. There is no relationship between BMI and the time it takes runners under the age of 35 to complete a 5 km race.

B. For runners under 35, we see that as BMI increases so does the time it takes to complete a 5 km race. This suggests that there is a positive relationship between these two variables for runners under 35 in this data set.

C. For runners under 35, we see that as BMI increases the time it takes to complete a 5 km decreases. This suggests that there is a negative relationship between these two variables for runners under 35 in this data set.

D. For runners under 35, we see that as BMI decreases the time it takes to complete a 5km race increases. This suggests that there is a negative relationship between these two variables for runners under 35 in this data set.

Assign your answer to an object called answer1.9. Make sure to write the uppercase letter for the answer you have chosen and surround the letter with quotes.

# Assign your answer to an object called: answer1 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
# The tests were intentionally hidden so that you can practice deciding # when you have the correct answer.

Question 1.10
{points: 1}

Now explore the relationship between the age of all runners and the time taken to complete the 5 km run (in minutes again). Using the original marathon_small data frame, mutate the km5_time_seconds column such that it is in minutes. Next, create a scatter plot (similar to the one in Question 1.9) but this time have age on the x-axis. Assign your answer to an object called age_vs_time.

There is a lot missing from the cell below (no hints were given). Try working on it on your own before looking at earlier questions in this tutorial or worksheet_01.

# your code here fail() # No Answer - remove if you provide an answer age_vs_time
test_1.10()

Question 1.11
{points: 3}

In the plot above do see a positive relationship between age and time taken to complete a 5 km run. Is this postive relationship strong (points closely follow a line/path) or weak (points are more widely scattered)?

Assign your answer (of either strong or weak) to an object called answer1.11. Make sure to write in all lower-case and surround the letter with quotes.

# your code here fail() # No Answer - remove if you provide an answer
# The tests were intentionally hidden so that you can practice deciding # when you have the correct answer.

2. Bike Sharing

Climate change, and solutions to mitigate it, is currently on the tongues and minds of many people. One healthy and environmentally friendly transporation alternative that has been recently gaining popularity is bike sharing. Apart from their extensive real world applications in improving health and creating more climate-friendly transit, the data being generated by these bike sharing systems makes them great for research. In contrast to bus and subway transit systems, bikeshare transit systems precisely document where a trip starts, ends and how long it lasts, for each individual using the system. This level of individual tracibility may allow for better detection of mobility patterns in cities, as well as the potential detection of important events.

Today, we will be analyzing data obtained from Capital Bikeshare, a bike sharing system from Washington, DC. The temperature data (in units of degrees Celsius) has been normalized from the original range so that all values are within the range of 0 and 1 (a common data processing technique helpful for some machine/statistical learning tools). Our goal is to figure out the relationship between temperature and the amount of people renting bikes during the Spring (March 20th - June 21st).

Question 2.1 Multiple Choice
{points: 1}

In comparison to bikes, why aren't other modes of transportation as useful when it comes to acquiring data?

A. Not as fast.

B. Documentation isn't as precise.

C. Not as environmentally friendly.

D. Bus drivers don't cooperate.

Assign your answer to an object called: answer2.1. Make sure the correct answer is an uppercase letter.

# Assign your answer to an object called: answer2.1 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_2.1()

Question 2.2 Multiple Choice
{points: 1}

What are the units for the normalized temperature?

A. Kelvin

B. Fahrenheit

C. Celsius

Assign your answer to an object called: answer2.2. Make sure the correct answer is an uppercase letter.

# Assign your answer to an object called: answer2.2 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_2.2()

Question 2.3
{points: 1}

Since we already have tidyverse loaded and ready to use, the first step is to read our new data. Add in the missing function and symbol to complete the cell below. Make sure to assign your answer to bike_data.

# bike_data ... ...("bike_share.csv") # Take the code above and fill in the ... # Once finished, copy and replace fail(). # As shown in the hint, remember to name your answer: bike_data # your code here fail() # No Answer - remove if you provide an answer head(bike_data)
test_2.3()

Question 2.4
{points: 1}

Mutate the data such that you have a new column called total_users. This column would be the sum of the casual_users and the registered_users.

# bike_mutate <- ...(bike_data, ...) # Take the code above and fill in the ... # Once finished, copy and replace fail(). # As shown in the hint, remember to name your answer: bike_mutate # your code here fail() # No Answer - remove if you provide an answer head(bike_mutate)
test_2.4()

Question 2.5
{points: 1}

Filter out the data to include information about rentals that were only made during Spring. Name your answer bike_filter.

# bike_filter <- ...(bike_mutate, ... == "Spring") # Take the code above and fill in the ... # Once finished, copy and replace the fail(). # As shown in the hint, remember to name your answer: bike_filter # your code here fail() # No Answer - remove if you provide an answer head(bike_filter)
test_2.5()

Question 2.6
{points: 3}

Select data from the columns that we wish to plot.

Hint: if you have forgetten, scroll up and re-read the introduction to this exercise. Name your answer bike_select.

# bike_select <- select(...) # Take the code above and fill in the ... # Once finished, copy and replace the fail(). # As shown in the hint, remember to name your answer: bike_select # your code here fail() # No Answer - remove if you provide an answer head(bike_select)
# The tests were intentionally hidden so that you can practice deciding # when you have the correct answer.

Question 2.7
{points: 3}

Plot the data as a scatter plot.

There is a lot missing from the cell below (no hints were given). Try completing this on your own before looking at Exercise 1 of this tutorial or worksheet_01. Assign your plot to an object called bike_plot_spring.

Hint: what do you think should be the x-axis / y-axis? Don't forget to label your axes!

# Replace the fail() with your line of code (answer). # As shown in the first line, remember to name your plot: bike_plot_spring # Make sure to use xlab() and ylab() to label your axes. # your code here fail() # No Answer - remove if you provide an answer bike_plot_spring
# The tests were intentionally hidden so that you can practice deciding # when you have the correct answer.

Question 2.8
{points: 3}

In 1-2 sentences, describe whether there is a relationship between the variables observed in the scatterplot of the data for the spring season. Comment on the direction and the strength of the relationship (if there is one), and how the variables change with respect to each other (if they do).

YOUR ANSWER HERE

3. Bike Sharing Continued...

We are going to continue working with this informative data set but modify it from Exercise 2. This part of the tutorial will focus on your understanding of how functions work and test your practice of correctly filling in code to get the right output. No hints will be provided so you won't be seeing ... any more. The number of questions with autograding and tests has also been intentionally decreased.

Unlike Exercise 2, now we want to figure out the relationship between temperature and the amount of people renting bikes during Fall (September 22nd - December 21st).

Try completing this Exercise from start to finish without any outside help. If you are struggling with a particular question, look at Exercise 2 for assistance.

Question 3.1 Multiple Choice
{points: 1}

What column is going to be filtered in Exercise 3?

A. casual_users

B. season

C. temperature

D. total_users

Assign your answer to an object called answer_filter. Make sure to write a capital letter for the answer you have chosen.

# Assign your answer to an object called: answer_filter # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_3.1()

Question 3.2
{points: 3}

Remember, you already have tidyverse loaded and you already read in the data. The next step is to mutate the data such that we have information on all the users. Make sure to save your answer to an object called bike_mutated.

# Remember to name your answer: bike_mutated # Replace fail() with your line of code. # your code here fail() # No Answer - remove if you provide an answer head(bike_mutated)
# The tests were intentionally hidden so that you can practice deciding # when you have the correct answer.

Question 3.3
{points: 3}

Filter out the data to include information about rentals that were only made during Fall - assign this data frame to an object called bike_filtered. Next, select for the columns we wish to plot. Name your answer bike_selected.

# Remember to name your answer: bike_selected # Replace fail() with your line of code. # your code here fail() # No Answer - remove if you provide an answer head(bike_selected)
# The tests were intentionally hidden so that you can practice deciding # when you have the correct answer.

Question 3.4
{points: 3}

Plot the data as a scatter plot. Assign your plot to an object called bike_plot_fall.

# Replace the fail() with your line of code (answer). # As shown in the first line, remember to name your plot: bike_plot_fall # Label your x-axis: Temperature (Celsius) # Label your y-axis: Total Users (Casual and Registered) # your code here fail() # No Answer - remove if you provide an answer bike_plot_fall
# The tests were intentionally hidden so that you can practice deciding # when you have the correct answer.

Question 3.5
{points: 3}

In one sentence, describe whether there is a relationship observed in the scatter plot for the fall season, and if so, the direction of that relationship.

YOUR ANSWER HERE

Question 3.6
{points: 3}

Looking at the scatter plots for the spring and the fall seasons, what difference(s) do you see? Based on these two plots, what might you recommend to this company to increase their users?

YOUR ANSWER HERE