Path: blob/master/2019-fall/materials/tutorial_01/tutorial_01.ipynb
2051 views
Tutorial 1: Introduction to Data Science
Any place you see ...
, you must fill in the function, variable, or data to complete the code. Replace fail()
with your completed code and run the cell!
Reminder: All autograded questions (i.e., questions with tests) are worth 1 point and all hidden test and manually graded questions are worth 3 points.
Revision Question Match the following definitions with the corresponding functions used in R:
{points: 1}
Definitions
A. Reads the most common types of flat file data, comma separated values and tab separated values, respectively.
B. Keeps only the variables you mention.
C. Keeps only rows with entries satisfying some logical condition that you specify.
D. Adds a new variable to a data frame as a function of the old columns.
E. Declares the input data frame for a graphic and specifies the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
F. Returns the first six rows or values of a vector, matrix, table, data frame or function.
Functions
ggplot
select
head
read_csv
mutate
filter
For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list of functions. For example: B <- 1
1. Vickers and Vertosick Exercise
We hope you haven't forgotten about them just yet! As you might recall from lecture, Vickers and Vertosick were the researchers that wanted to study different factors affecting race performance of recreational runners. They assembled a data set that includes the age, sex, and BMI of runners, comparing it with their timed performance (how long it took them to complete either 5 or 10 km runs).
We will be continuing our analysis on their data to practice what you learnt during lecture. The goal for today, however, is to produce a plot of BMI against the time (in minutes) it took for participants under the age of 35 to run 5 kilometres. To do this we will need to do the following:
use
filter
to extract the rows where age is less than 35use
select
to extract thebmi
andkm5_time_seconds
columnsuse
mutate
to convert 5 km race time from seconds (km5_time_seconds
) to minutesuse
ggplot
to create our plot of BMI (x-axis) and race time in minutes (y-axis)
Hints for success: Try going through all the steps on your own, but don't forget to talk to others (classmates, TAs, Instructor) if you need help getting unstuck. Work with different functions and if something doesn't work out, read the error message or use the help()
function. Since there are a lot of steps to working and modifying data, feel free to look back at worksheet_01.
Question 1.1 Multiple Choice
{points: 1}
After reading the text above (and remembering that filter
lets us choose rows that have values at, above, or below a threshold), what column do you think we will be using for our threshold when we filter?
A. age
B. km5_time_seconds
C. bmi
D. sex
Assign your answer to an object called answer1
. Make sure to write the uppercase letter for the answer you have chosen and surround the letter with quote.
Question 1.2 True or False
{points: 1}
We will be selecting the columns age
and km5_time_seconds
to plot. True or false?
Assign your answer (of either true or false) to an object called answer2
. Make sure to write in all lower-case and surround the letter with quotes.
Question 1.3 Multiple Choice
{points: 1}
Select the answer with the correct order of functions that we will use to wrangle our data into a useable form for the plot we want to create.
A. mutate
, ggplot
, select
B. mutate
, read_csv
, select
C. filter
, select
, mutate
D. filter
, aes
, ggplot
Assign your answer to an object called answer3
. Make sure to write the uppercase letter for the answer you have chosen and surround the letter with quote.
Question 1.4
{points: 1}
To work on the cells below, load the package "tidyverse".
Question 1.5
{points: 1}
With the proper package running, you can now load the data.
Replace fail()
with the correct function. Assign your data to an object called marathon_small
.
Question 1.6
{points: 1}
Filter
out and select
the data such that information is only included from participants under the age of 35 and your data frame only contains the columns needed for the plot.
Hint: bmi
is already given to you. What else do we want to plot?
Question 1.7
{points: 1}
Mutate the data frame to create a new column called: km5_time_minutes
.
Note: we will be selecting once again the specific columns we want to include in our data frame.
Question 1.8
{points: 1}
Lastly, generate a scatter plot. Assign your plot to an object called marathon_plot
.
Label your axes in a human readable way (do not leave them as default column names).
note - the warning above simply tells us the number of rows that had missing data in the data set, and that these rows were not plotted. When you see something like this, you should stop and think, do I expect missing rows in my data? Sometimes the answer is yes, sometimes it is no. It depends on the data set and you as the Data Scientist need to know the answer to this. How do you know the answer? By talking to those who collected the data and/or researching where the data came from, for example.
Question 1.9
{points: 3}
What sentences below best describes the plot above? One or more may be correct.
A. There is no relationship between BMI and the time it takes runners under the age of 35 to complete a 5 km race.
B. For runners under 35, we see that as BMI increases so does the time it takes to complete a 5 km race. This suggests that there is a positive relationship between these two variables for runners under 35 in this data set.
C. For runners under 35, we see that as BMI increases the time it takes to complete a 5 km decreases. This suggests that there is a negative relationship between these two variables for runners under 35 in this data set.
D. For runners under 35, we see that as BMI decreases the time it takes to complete a 5km race increases. This suggests that there is a negative relationship between these two variables for runners under 35 in this data set.
Assign your answer to an object called answer1.9
. Make sure to write the uppercase letter for the answer you have chosen and surround the letter with quotes.
Question 1.10
{points: 1}
Now explore the relationship between the age of all runners and the time taken to complete the 5 km run (in minutes again). Using the original marathon_small
data frame, mutate the km5_time_seconds
column such that it is in minutes. Next, create a scatter plot (similar to the one in Question 1.9) but this time have age
on the x-axis. Assign your answer to an object called age_vs_time
.
There is a lot missing from the cell below (no hints were given). Try working on it on your own before looking at earlier questions in this tutorial or worksheet_01.
Question 1.11
{points: 3}
In the plot above do see a positive relationship between age and time taken to complete a 5 km run. Is this postive relationship strong (points closely follow a line/path) or weak (points are more widely scattered)?
Assign your answer (of either strong or weak) to an object called answer1.11
. Make sure to write in all lower-case and surround the letter with quotes.
2. Bike Sharing
Climate change, and solutions to mitigate it, is currently on the tongues and minds of many people. One healthy and environmentally friendly transporation alternative that has been recently gaining popularity is bike sharing. Apart from their extensive real world applications in improving health and creating more climate-friendly transit, the data being generated by these bike sharing systems makes them great for research. In contrast to bus and subway transit systems, bikeshare transit systems precisely document where a trip starts, ends and how long it lasts, for each individual using the system. This level of individual tracibility may allow for better detection of mobility patterns in cities, as well as the potential detection of important events.
Today, we will be analyzing data obtained from Capital Bikeshare, a bike sharing system from Washington, DC. The temperature data (in units of degrees Celsius) has been normalized from the original range so that all values are within the range of 0 and 1 (a common data processing technique helpful for some machine/statistical learning tools). Our goal is to figure out the relationship between temperature and the amount of people renting bikes during the Spring (March 20th - June 21st).
Question 2.1 Multiple Choice
{points: 1}
In comparison to bikes, why aren't other modes of transportation as useful when it comes to acquiring data?
A. Not as fast.
B. Documentation isn't as precise.
C. Not as environmentally friendly.
D. Bus drivers don't cooperate.
Assign your answer to an object called: answer2.1
. Make sure the correct answer is an uppercase letter.
Question 2.2 Multiple Choice
{points: 1}
What are the units for the normalized temperature?
A. Kelvin
B. Fahrenheit
C. Celsius
Assign your answer to an object called: answer2.2
. Make sure the correct answer is an uppercase letter.
Question 2.3
{points: 1}
Since we already have tidyverse
loaded and ready to use, the first step is to read our new data. Add in the missing function and symbol to complete the cell below. Make sure to assign your answer to bike_data
.
Question 2.4
{points: 1}
Mutate the data such that you have a new column called total_users
. This column would be the sum of the casual_users
and the registered_users
.
Question 2.5
{points: 1}
Filter out the data to include information about rentals that were only made during Spring
. Name your answer bike_filter
.
Question 2.6
{points: 3}
Select data from the columns that we wish to plot.
Hint: if you have forgetten, scroll up and re-read the introduction to this exercise. Name your answer bike_select
.
Question 2.7
{points: 3}
Plot the data as a scatter plot.
There is a lot missing from the cell below (no hints were given). Try completing this on your own before looking at Exercise 1 of this tutorial or worksheet_01. Assign your plot to an object called bike_plot_spring
.
Hint: what do you think should be the x-axis / y-axis? Don't forget to label your axes!
Question 2.8
{points: 3}
In 1-2 sentences, describe whether there is a relationship between the variables observed in the scatterplot of the data for the spring season. Comment on the direction and the strength of the relationship (if there is one), and how the variables change with respect to each other (if they do).
YOUR ANSWER HERE
3. Bike Sharing Continued...
We are going to continue working with this informative data set but modify it from Exercise 2. This part of the tutorial will focus on your understanding of how functions work and test your practice of correctly filling in code to get the right output. No hints will be provided so you won't be seeing ...
any more. The number of questions with autograding and tests has also been intentionally decreased.
Unlike Exercise 2, now we want to figure out the relationship between temperature and the amount of people renting bikes during Fall (September 22nd - December 21st).
Try completing this Exercise from start to finish without any outside help. If you are struggling with a particular question, look at Exercise 2 for assistance.
Question 3.1 Multiple Choice
{points: 1}
What column is going to be filtered in Exercise 3?
A. casual_users
B. season
C. temperature
D. total_users
Assign your answer to an object called answer_filter
. Make sure to write a capital letter for the answer you have chosen.
Question 3.2
{points: 3}
Remember, you already have tidyverse
loaded and you already read in the data. The next step is to mutate the data such that we have information on all the users. Make sure to save your answer to an object called bike_mutated
.
Question 3.3
{points: 3}
Filter out the data to include information about rentals that were only made during Fall
- assign this data frame to an object called bike_filtered
. Next, select for the columns we wish to plot. Name your answer bike_selected
.
Question 3.4
{points: 3}
Plot the data as a scatter plot. Assign your plot to an object called bike_plot_fall
.
Question 3.5
{points: 3}
In one sentence, describe whether there is a relationship observed in the scatter plot for the fall season, and if so, the direction of that relationship.
YOUR ANSWER HERE
Question 3.6
{points: 3}
Looking at the scatter plots for the spring and the fall seasons, what difference(s) do you see? Based on these two plots, what might you recommend to this company to increase their users?
YOUR ANSWER HERE