Path: blob/master/2019-spring/materials/tutorial_01/tutorial_01.ipynb
2051 views
Tutorial 1: Introduction to Data Science
Any place you see ...
, you must fill in the function, variable, or data to complete the code. Replace fail()
with your completed code and run the cell!
Revision Question Match the following definitions with the corresponding functions used in R:
Definitions
A. Reads the most common types of flat file data, comma separated values and tab separated values, respectively.
B. Keeps only the variables you mention.
C. Applies linear filtering to a univariate time series or to each series separately of a multivariate time series.
D. Executes the transformations iteratively so that later transformations can use the columns created by earlier transformations.
E. Declares the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
F. Returns the first six rows or values of a vector, matrix, table, data frame or function.
Functions
ggplot
select
head
read_csv
mutate
filter
For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list of functions. For example: B <- 1
1. Vickers and Vertosick Exercise
We hope you haven't forgotten about them just yet! As you might recall from lecture, Vickers and Vertosick were the researchers that wanted to study different factors affecting race performance of recreational runners. They assembled a data set that includes the age, sex, and BMI of runners, comparing it with their timed performance (how long it took them to complete either 5 or 10 km runs).
We will be continuing our analysis on their data and practice what you learnt during lecture. The goal for today, however, is to produce a plot of BMI against the time (in hours) it took for participants over the age of 30 to run 10 kms. To do this we will need to do the following:
use
filter
to subset the rows where age is greater than 30use
select
to subset thebmi
andkm10_time_seconds
columnsuse
mutate
to convert 10 km race time from seconds (km10_time_seconds
) to hoursuse
ggplot
to create our plot of BMI and race time in hours
Hints for success: Try going through all the steps on your own, but don't forget to talk to others (classmates, TAs, Instructor) if you need help getting unstuck. Work with different functions and if something doesn't work out, read the error message or use the help()
function. Since there are a lot of steps to working and modifying data, feel free to look back at worksheet_01.
Question 1.1 Multiple Choice:
After reading the text above (and remembering that filter
lets us choose rows that have values at, above, or below a threshold), what column do you think we will be using for our threshold when we filter?
A. bmi
B. sex
C. age
D. km10_time_seconds
Assign your answer to an object called answer1
. Make sure to write the uppercase letter for the answer you have chosen.
Question 1.2 True or False:
We will be selecting the columns age
and km10_time_seconds
to plot.
Assign your answer to an object called answer2
. Make sure to write in all lower-case.
Question 1.3 Multiple Choice:
Select the answer with the correct order of functions that we will use to wrangle our data into a useable form for the plot we want to create.
A. filter
, select
, mutate
B. mutate
, ggplot
, select
C. mutate
, read_csv
, select
D. filter
, aes
, ggplot
Assign your answer to an object called answer3
. Make sure to write the uppercase letter for the answer you have chosen.
Question 1.4
To work on the cells below, load the package "tidyverse".
Question 1.5
With the proper package running, you can now load the data - replace fail()
with the correct function. Assign your data to marathon_small
.
Question 1.6
Filter
out and select
the data such that information is only included from participants over the age of 30 and your data frame has only the columns needed for the plot.
Next, select the columns we wish to plot.
Hint: bmi
is already given to you. What else do we want to plot?
Question 1.7
Mutate the data frame to create a new column called: km10_time_hours
.
Note: we will be selecting once again which specific columns we want to include in our data frame.
Question 1.8
Lastly, generate a scatter plot. Be smart in choosing your axes. If you have trouble remembering the code to create a graph, go to worksheet_01 and read over Graphing. Assign your plot to an object calles marathon_plot
. Label your axes in a human readable way (do not leave them as default column names).
Question 1.9
Do you see any pattern in the relationship between BMI and 10 km race time?
YOUR ANSWER HERE
Question 1.10
Now explore the relationship between the age of all runners and the time taken to complete the 10k run (in hours again). Do this by creating a scatter plot (similar to the one in Question 1.9).
There is a lot missing from the cell below (no hints were given). Try looking at earlier questions in this tutorial or worksheet_01 to get you started.
Question 1.11
Do you see any pattern in the relationship you explored in Question 1.10? Explain in written english.
YOUR ANSWER HERE
2. Bike Sharing
Climate change, and solutions to mititgate it, is currently on the tongues and minds of many people. One healthy and environmentally friendly transporation alternative that has been recently gaining popularity is bike sharing. Apart from their extensive real world applications in improving health and creating more climate-friendly transit, the data being generated by these bike sharing systems makes them great for research. In contrast to bus and subway transit systems, bikeshare transist systems precisely document where a trip starts, ends and how long it lasts, for each individual using the system. This level of individual tracibility may allow for better detection of mobility patterns in cities, as well as the potential detection of important events.
Today, we will be analyzing data obtained from Capital Bikeshare (data source), a bike sharing system from Washington, DC. The temperature data (in units of degrees Celcius) has been normalized from the original range so that all values are within the range of 0 and 1 (a common data processing technique helpful for some machine/statistical learning tools). Our goal is to figure out the relationship between temperature and the amount of people renting bikes during the Spring (March 20th - June 21st).
Question 2.1 Multiple Choice:
In comparison to bikes, why aren't other modes of transportation as useful when it comes to acquiring data?
A. Not as fast.
B. Documentation isn't as precise.
C. Not as environmentally friendly.
D. Bus drivers don't cooperate.
Assign your answer to an object called: answer2.1
. Make sure the correct answer is an uppercase letter.
Question 2.2 Multiple Choice:
What are the units for the normalized temperature?
A. Kelvin
B. Fahrenheit
C. Celcius
Assign your answer to an object called: answer2.2
. Make sure the correct answer is an uppercase letter.
Question 2.3
Since we already have tidyverse
loaded and ready to use, the first step is to read our new data. Add in the missing function and symbol to complete the cell below. Make sure to assign your answer to bike_data
.
Question 2.4
Mutate the data such that you have a new column called total_users
. This column would be the sum of the casual_users
and the registered_users
.
Question 2.5
Filter out the data to include information about rentals that were only made during Spring
. Name your answer bike_filter
.
Question 2.6
Select data from the columns that we wish to plot.
Hint: if you have forgetten, scroll up and re-read the introduction to this exercise. Name your answer bike_select
.
Question 2.7
Plot the data as a scatter plot.
There is a lot missing from the cell below (no hints were given). Try completing this on your own before looking at Exercise 1 of this tutorial or worksheet_01. Assign your plot to an object called bike_plot_spring
.
Hint: what do you think should be the x-axis / y-axis? Don't forget to label your axes!
Question 2.8
In one sentence, describe the trend of your scatterplot of the data plotted above for the spring season.
YOUR ANSWER HERE
3. Bike Sharing Continued...
We are going to continue working with this informative data set but modify it from Exercise 2. This part of the tutorial will focus on your understanding of how functions work and testing your practice of correctly filling in code to get the right output. No hints will be provided so you won't be seeing anymore ...
. The number of questions with autograding and tests has also been intentionally decreased.
Unlike Exercise 2, now we want to figure out the relationship between temperature and the amount of people renting bikes during Fall (September 22nd - December 21st).
Try completing this Exercise from start to finish without any outside help. If you are struggling with a particular question, look at Exercise 2 for assistance.
Question 3.1 Multiple Choice:
What column is going to be filtered in Exercise 3?
A. casual_users
B. season
C. temperature
D. total_users
Assign your answer to an object called answer_filter
. Make sure to write a capital letter for the answer you have chosen.
Question 3.2
Remember, you already have tidyverse
loaded and you already read in the data. The next step is to mutate the data such that we have information on all the users. Make sure to save your answer to an object called bike_mutated
.
Depending on what you find efficient and easy, use pipe operators or multiple lines of code when needed.
Question 3.3
Filter out the data to include information about rentals that were only made during Fall
. Next, select for the columns we wish to plot. Name your answer bike_selected
.
Question 3.4
Plot the data as a scatter plot. Assign your plot to an object called bike_plot_fall
.
Question 3.5
In one sentence, describe the trend of your scatterplot for the fall season.
YOUR ANSWER HERE
Question 3.6
Looking at the scatterplots for the spring and the fall seasons, what difference(s) do you see? Based on these two plots, what might you recommend to this company to increase their users?
YOUR ANSWER HERE