Path: blob/master/2021-spring/materials/tutorial_01/tutorial_01.ipynb
2051 views
Tutorial 1: Introduction to Data Science
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
use a Jupyter notebook to execute provided R code
edit code and markdown cells in a Jupyter notebook
create new code and markdown cells in a Jupyter notebook
load the
tidyverse
library into Rcreate new variables and objects in R using the assignment symbol
use the help and documentation tools in R
match the names of the following functions from the
tidyverse
library to their documentation descriptions:read_csv
select
mutate
filter
ggplot
aes
Any place you see ...
, you must fill in the function, variable, or data to complete the code. Replace fail()
with your completed code and run the cell!
Reminder: All autograded questions (i.e., questions with tests) are worth 1 point and all hidden test and manually graded questions are worth 3 points.
Revision Question Match the following definitions with the corresponding functions used in R:
{points: 1}
Definitions
A. Reads the most common types of flat file data, comma separated values.
B. Keeps only the variables you mention.
C. Keeps only rows with entries satisfying some logical condition that you specify.
D. Adds a new variable to a data frame as a function of the old columns.
E. Declares the input data frame for a graphic and specifies the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
Functions
ggplot
select
filter
read_csv
mutate
For each definition, assign the integer corresponding to the correct function to the letter object associated with the defintion. For example:
Assign your answers to the objects A
, B
, C
, D
, and E
. Your answers should each be a single integer.
1. Vickers and Vertosick Exercise
We hope you haven't forgotten about them just yet! As you might recall from lecture, Vickers and Vertosick were the researchers that wanted to study different factors affecting race performance of recreational runners. They assembled a data set that includes the age, sex, and Body Mass Index (BMI) of runners, comparing it with their timed performance (how long it took them to complete either 5 or 10 km runs).
We will be continuing our analysis of their data to practice what you learnt during the previous lecture. The goal for today, however, is to produce a plot of BMI against the time (in minutes) it took for participants under the age of 35 to run 5 kilometres. To do this, we will need to complete the following steps:
use
filter
to extract the rows where age is less than 35use
select
to extract thebmi
andkm5_time_seconds
columnsuse
mutate
to convert 5 km race time from seconds (km5_time_seconds
) to minutesuse
ggplot
to create our plot of BMI (x-axis) and race time in minutes (y-axis)
Tips for success: Try going through all of the steps on your own, but don't forget to discuss with others (classmates, TAs, or an instructor) if you get stuck. If something is wrong and you can't spot the issue, be sure to read the error message carefully. Since there are a lot of steps involved in working with data and modifying it, feel free to look back at worksheet_01
for assistance.
Question 1.1 Multiple Choice
{points: 1}
After reading the text above (and remembering that filter
lets us choose rows that have values at, above, or below a threshold), what column do you think we will be using for our threshold when we filter?
A. age
B. km5_time_seconds
C. bmi
D. sex
Assign your answer to an object called answer1.1
. Make sure to write the uppercase letter for the answer you have chosen and surround the letter with quotes.
Question 1.2 True or False
{points: 1}
We will be selecting the columns age
and km5_time_seconds
to plot. True or false?
Assign your answer (of either "true"
or "false"
) to an object called answer1.2
. Make sure to write in all lower-case and surround your answer with quotes.
Question 1.3 Multiple Choice
{points: 1}
Select the answer with the correct order of functions that we will use to wrangle our data into a useable form for the plot we want to create.
A. mutate
, select
, filter
B. select
, filter
, aes
C. filter
, select
, mutate
D. filter
, select
, aes
E. select
, filter
, mutate
Assign your answer to an object called answer1.3
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
Question 1.4
{points: 1}
To work on the cells below, load the tidyverse
package. If you have difficulty with loading this package, revisit worksheet_01
and read over Section 5 (Packages).
Question 1.5
{points: 1}
With the proper package loaded, you can now read in the data.
Replace fail()
with the correct function. Assign your data to an object called marathon_small
.
Question 1.6
{points: 1}
filter
and select
the data (marathon_small
) such that information is only included from participants under the age of 35 and only contains the columns needed for the plot.
Hint: bmi
is already given to you. What else do we want to plot?
Name the result of filtering marathon_age
, and name the result of selecting marathon_select
.
Question 1.7
{points: 1}
Mutate the data frame (marathon_select
) to create a new column called: km5_time_minutes
.
Note: we will be selecting once again the specific columns we want to include in our data frame.
Name the result after creating the new column marathon_mutate
, and name the result after selecting the columns used for plotting marathon_exact
.
Question 1.8
{points: 1}
Lastly, generate a scatter plot. Assign your plot to an object called marathon_plot
.
Ensure that your axis labels are human-readable (do not leave them as default column names).
Note: the warning message above tells us the number of rows that had missing data in the data set, and that these rows were not plotted. When you see something like this, you should stop and think, do I expect missing rows in my data? Sometimes the answer is yes, sometimes it is no. It depends on the data set, and you as the Data Scientist must know the answer to this. How would you determine the answer? By talking to those who collected the data and/or researching where the data came from, for example.
Question 1.9
{points: 3}
Which option below best describes the plot above?
A. For runners under the age of 35, there is no relationship at all between BMI and the time it takes to complete a 5 km race .
B. For runners under 35, we see that as BMI increases the time it takes to complete a 5 km race increases. This suggests that there is a positive relationship between these two variables for runners under 35 in this data set.
C. For runners under 35, we see that as BMI increases the time it takes to complete a 5 km decreases. This suggests that there is a negative relationship between these two variables for runners under 35 in this data set.
Assign your answer to an object called answer1.9
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
Question 1.10
{points: 1}
Now explore the relationship between the age of all runners and the time taken to complete the 5 km run (in minutes again). Using the original marathon_small
data frame, mutate the km5_time_seconds
column such that it is in minutes. Save the resulting data frame to an object called marathon_small_mins
.
Next, create a scatter plot (similar to the one in Question 1.9) but this time have age
on the x-axis. Assign your plot to an object called age_vs_time
.
There is a lot missing from the cell below (no hints were given). Try working on it on your own before looking at earlier questions in this tutorial or worksheet_01.
Don't forget to label your axes! Where appropriate, axes labels should also include units (for example, the axis that maps to the column age
should have the unit "years").
Question 1.11
{points: 3}
In the plot above, we can see a positive relationship between age and time taken to complete a 5 km run. Is this relationship strong (points are close together) or weak (points are more widely scattered)?
Assign your answer (either "weak"
or "strong"
) to an object called answer1.11
. Make sure to write in all lower-case and surround your answer with quotes.
2. Bike-Sharing
Climate change, and solutions to mitigate it, is currently on the tongues and minds of many people. One healthy and environmentally friendly transportation alternative that has been recently gaining popularity is bike-sharing. Apart from their extensive real-world applications in improving health and creating more climate-friendly transit, the data generated by these bike-sharing systems makes them great for research. In contrast to bus and subway transit systems, bike-share transit systems precisely document where a trip starts, where it ends, and how long it lasts, for each individual using the system. This level of individual traceability may allow for better detection of mobility patterns in cities and possible detection of important events.
Today, we will be analyzing data obtained from Capital Bikeshare, a bike-sharing system from Washington, DC. The temperature data (in units of degrees Celsius) has been normalized from the original range so that all values fall between 0 and 1 (a common data processing technique helpful for some machine/statistical learning tools). Our goal is to determine if there is a relationship between temperature and the number of people renting bikes during the Spring (March 20th - June 21st).
Question 2.1 Multiple Choice
{points: 1}
In comparison to bike-sharing systems, why aren't other modes of transportation as useful when it comes to acquiring data?
A. Not as fast.
B. Documentation isn't as precise.
C. Not as environmentally friendly.
D. Bus drivers don't cooperate.
Assign your answer to an object called: answer2.1
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
Question 2.2 Multiple Choice
{points: 1}
What are the units for the normalized temperature?
A. Kelvin
B. Fahrenheit
C. Celsius
Assign your answer to an object called: answer2.2
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
Question 2.3
{points: 1}
Since we already have tidyverse
loaded and ready to use, the first step is to read our new data. Add in the missing function and symbol to complete the cell below. Make sure to assign your answer to bike_data
.
Question 2.4
{points: 1}
Mutate the data such that you have a new column called total_users
. This column would be the sum of the casual_users
and the registered_users
. Assign your answer to an object called bike_mutate
.
Question 2.5
{points: 1}
Filter out the data to include information about rentals that were only made during Spring
. Name your answer bike_filter
.
Question 2.6
{points: 3}
Select the columns from the data that we wish to plot. Name your answer bike_select
.
Hint: if you have forgetten, scroll up and re-read the introduction to this exercise.
Question 2.7
{points: 3}
Plot the data as a scatter plot.
There is a lot missing from the cell below (no hints were given). Try completing this on your own before looking back at any previous exercises. Assign your plot to an object called bike_plot_spring
.
Hint: what do you think should be the x-axis / y-axis? Don't forget to label your axes! Where appropriate, axes labels should also include units (for example, the axis mapped to the temperature
column should have the units "normalized degrees Celsius").
Question 2.8
{points: 3}
In 1-2 sentences, describe whether there is a relationship between the variables observed in the scatterplot of the data for the spring season. Comment on the direction and the strength of the relationship (if there is one), and how the variables change with respect to each other (if they do).
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
3. Bike-Sharing Continued
For this exercise, we are going to continue working with Capital Bikeshare dataset. This part of the tutorial will focus on your understanding of how the functions work and test your ability to write code without hints. Note that we have also intentionally decreased the number of auto-graded questions for the remainder of the tutorial.
Unlike the previous exercise, we now want to determine if there is a relationship between temperature and the amount of people renting bikes during Fall (September 22nd - December 21st).
Try completing this Exercise from start to finish without any outside help. If you are struggling with a particular question, look at Exercise 2 for assistance.
Question 3.1 Multiple Choice
{points: 1}
Which column is going to be filtered during this exercise?
A. casual_users
B. season
C. temperature
D. total_users
Assign your answer to an object called answer3.1
. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
Question 3.2
{points: 3}
Recall that the tidyverse
package has loaded and the data has already been read. The next step is to mutate the data such that we have information on all the users. Make sure to save your answer to an object called bike_mutated
, and make sure to create a column called total_users
.
Question 3.3
{points: 3}
Filter out the data to include information about rentals that were only made during Fall
- assign this data frame to an object called bike_filtered
. Next, select for the columns we wish to plot. Name your answer bike_selected
.
Question 3.4
{points: 3}
Plot the data as a scatter plot. Label your x-axis: Temperature (Celsius)
and your y-axis: Total Users (Casual and Registered)
. Assign your plot to an object called bike_plot_fall
.
Question 3.5
{points: 3}
In one sentence, describe whether there is a relationship observed in the scatter plot for the fall season, and if so, the direction of that relationship.
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
Question 3.6
{points: 3}
Looking at the scatter plots for the spring and the fall seasons, what difference(s) do you see? Based on these two plots, what might you recommend to this company to increase their users?
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.