Path: blob/master/2019-spring/materials/worksheet_01/worksheet_01.ipynb
2051 views
Worksheet 1: Introduction to Data Science
Welcome to DSCI 100: Introduction to Data Science!
Each week you will complete a lecture assignment like this one. Before we get started, there are some administrative details.
You can't learn technical subjects without hands-on practice. The weekly lecture worksheets and tutorials are an important part of the course. The lecture worksheets will automatically be collected at the start of the weekly tutorial. Conversely, the tutorial assigments will automatically be collected at the start of the weekly lecture. This is set up so that you are only working on one thing at a time. Attendance in lectures and tutorials are required. There will be participatory activities in both the lecture and tutorial to help support your learning.
Collaborating on lecture worksheets and tutorial assignments is more than okay -- it's encouraged! You should rarely be stuck for more than a few minutes on questions in lecture or tutorial, so ask a neighbor, TA or an instructor for help (explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it). Please don't just share answers, though. Everyone must submit a copy of their own work.
You can read more about course policies on the course website.
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
use a Jupyter notebook to execute provided R code
edit code and markdown cells in a Jupyter notebook
create new code and markdown cells in a Jupyter notebook
load the
tidyverse
library into Rcreate new variables and objects in R using the assignment symbol
use the help and documentation tools in R
match the names of the following functions from the
tidyverse
library to their documentation descriptions:read_csv
select
mutate
filter
ggplot
aes
chain together two functions using the pipe operator,
%>%
In this first worksheet you will also learn how to test the answers you write in this worksheet to assess if you answered questions correctly before your assignment is collected.
This worksheet covers parts of Chapter 1 of the online textbook. You should read this chapter before attempting this worksheet.
1. Jupyter notebooks
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results.
1.1. Text cells
In a notebook, each rectangle containing text or code is called a cell.
Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called Markdown to add formatting and section headings. You don't need to learn Markdown, but you might want to.
After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions of the lab.)
Question 1.1.1. This paragraph is in its own text cell. Try editing it so that this sentence is the last sentence in the paragraph, and then click the "run cell" ▶| button . This sentence, for example, should be deleted. So should this one.
1.2. Code cells
Other cells contain code in the R language. Running a code cell will execute all of the code it contains.
To run the code in a cell, first click on that cell to activate it. It'll be highlighted with a little green or blue rectangle. Next, either press Run ▶| or hold down the shift
key and press return
or enter
.
Try running the next cell:
The above code cell contains a single line of code, but cells can also contain multiple lines of code. When you run a cell, the lines of code are executed in the order in which they appear. Every print
expression prints a line. Run the next cell and notice the order of the output.
Question 1.2.1. Change the cell above so that it prints out:
Hint: If you're stuck for more than a few minutes, try talking to a neighbor or a TA. That's a good idea for any worksheet or tutorial problem.
1.3. Writing Jupyter notebooks
You can use Jupyter notebooks for your own projects or documents. When you make your own notebook, you'll need to create your own cells for text and code.
To add a cell, click the + button in the menu bar. It'll start out as a code cell. You can change it to a text cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Markdown".
Question 1.3.1. Add a code cell below this one. Write code in it that prints out:
Run your cell to verify that it works.
Question 1.3.2. Add a text/Markdown cell below this one. Write the text "A whole new Markdown cell" in it.
1.4. Errors
R is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:
The rules are simple. You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
The rules are rigid. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running R code is not smart enough to do that.
Whenever you write code, you'll make mistakes (everyone who writes code does, even your course instructor!). When you run a code cell that has errors, R will sometimes produce error messages to tell you what you did wrong.
Errors are okay; even experienced programmers make many errors. When you make an error, you just have to find the source of the problem, fix it, and move on.
We have made an error in the next cell. Run it and see what happens.
There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it. (Of course, if you're frustrated, ask a neighbor or a TA for help.)
Try to fix the code above so that you can run the cell and see the intended message instead of an error.
1.5. The Kernel
The kernel is a program that executes the code inside your notebook and outputs the results. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (⚪), the kernel is idle and ready to execute code. If the circle is filled in (⚫), the kernel is busy running some code.
You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps:
At the top of your screen, click Kernel, then Interrupt.
If that doesn't help, click Kernel, then Restart. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work.
If that doesn't help, restart your server. First, save your work by clicking File at the top left of your screen, then Save and Checkpoint. Next, click Control Panel at the top right. Choose Stop My Server to shut it down, then My Server to start it back up. Then, navigate back to the notebook you were working on.
1.6. Submitting your work
All lecture worksheets and tutorials assignments in the course will be distributed as notebooks like this one. You will complete your work in this notebook and at the due date we will copy this notebook and grade that copy. For lecture worksheets we will use a system called nbgrader that checks your work. For tutorial assignments we will use a combination of nbgrader and manual grading of your work.
2. Numbers
Quantitative information arises everywhere in data science. In addition to representing commands to print out lines, our R code can represent numbers and methods of combining numbers. The expression 3.2500
evaluates to the number 3.25. (Run the cell and see.)
Notice that we didn't have to print. When you run a notebook cell, Jupyter helpfully prints out that value for you.
Above, you should see that the three numbers (2, 3, and 4) are printed out. In R, simply inputting numbers and running the cell will generate all the numbers that you listed. Even though we don't need to use print, we will continue to do in several places in these worksheets so that we are very clear with our intentions.
2.1. Arithmetic
The line in the next cell subtracts. Its value is what you'd expect. Run it.
Same with the cell below. Run it.
Many basic arithmetic operations are built in to R. This webpage describes all the arithmetic operators used in the course. You can refer back to this webpage as you need throughout the term.
3. Names
In natural language, we have terminology that lets us quickly reference very complicated concepts. We don't say, "That's a large mammal with brown fur and sharp teeth!" Instead, we just say, "Bear!"
Similarly, an effective strategy for writing code is to define names for data as we compute it, like a lawyer would define terms for complex ideas at the start of a legal document to simplify the rest of the writing.
In R, we do this with objects. An object has a name on the left side of an <-
sign and an expression to be evaluated on the right.
When you run that cell, R first evaluates the first line. It computes the value of the expression 3 * 2 + 4
, which is the number 10. Then it gives that value the name answer
. At that point, the code in the cell is done running.
After you run that cell, the value 10 is bound to the name answer
:
We can name our objects anything we'd like. Above we called it answer
, but we could have named it value
, data
or anything else we desired. A good rule of thumb is to name it something that has meaning to a human as it relates to what we are trying to accomplish with our R code.
Question 3.1. Enter a new code cell. Try creating another object using <- 3 * 2 + 4
with a name different from answer
.
A common pattern in Jupyter notebooks is to assign a value to a name and then immediately evaluate the name in the last line in the cell so that the value is displayed as output.
Another common pattern is that a series of lines in a single cell will build up a complex computation in stages, naming the intermediate results.
Names in R can have letters (upper- and lower-case letters are both okay and count as different letters), underscores, and numbers. The first character can't be a number (otherwise a name might look like a number). And names can't contain spaces, since spaces are used to separate pieces of code from each other.
Other than those rules, what you name something doesn't matter to R. For example, the next cell does the same thing as the above cell, except everything has a different name:
However, names are very important for making your code readable to yourself and others. The cell above is shorter, but it's totally useless without an explanation of what it does.
There is also cultural style associated with different programming languages. In the modern R style, object names should use only lowercase letters, numbers, and _
. Underscores (_
) are typically used to separate words within a name (e.g., answer_one
).
3.1. Comments
Below you see lines like this in code cells:
That is called a comment. It doesn't make anything happen in R; R ignores anything on a line after a #. Instead, it's there to communicate something about the code to you, the human reader. Comments are extremely useful and can help increase how readable our code is.
Question 3.2. Assign the name seconds_in_an_hour
to the number of seconds in an hour. You should do this in two steps. In the first you calculate the number of seconds in a minute and assign that number the name seconds_in_a_minute
. Next you shoud calculate the number of seconds in an hour and assign that number the name seconds_in_an_hour.
hint - there are 60 seconds in a minute and 60 minutes in a hour
3.2. Checking your code
Now that you know how to name things, you can start using the built-in tests to check whether your work is correct. To do this, you will need to run the cell below to set things up. In future worksheets and tutorial assignments you will see this cell at the very top of the notebook:
Below is an example of a test cell for Question 3.2 above (assesses whether you have assigned seconds_in_an_hour
correctly). If you haven't, this test will tell you the correct answer. Try not to change the contents of the test cells. Resist the urge to just copy it, and instead try to adjust your expression. (Sometimes the tests will give hints about what went wrong...)
For this first question we'll provide you the solution:
4. Calling functions
The most common way to combine or manipulate values in R is by calling functions. R comes with many built-in functions that perform common operations.
We used a function print()
at the beginning of this notebook when we printed text from a code cell. Here we'll demonstrate using another function toupper()
that converts text to uppercase:
Question 4.1. Use the function tolower
to change all the words in the following movie title to lower case text: "The House with a Clock in Its Walls" and assign the lower case text the name title
.
4.1. Multiple arguments
Some functions take multiple arguments, separated by commas. For example, the built-in max
function returns the maximum argument passed to it.
Question 4.1. Use the min
function to find the minumum value of the numbers in the cell above.
Assign the value to an object called smallest
.
5. Packages
R has many built-in functions, but we can also use functions that are stored within packages created by other R users. We are going to use a package, called tidyverse
, to load, modify and plot data. This package has already been installed for you. Later in the course you will learn how to install packages so you are free to bring in other tools as you need them for your data analysis.
To use the functions from a package you first need to load it using the library
function. This needs to be done once per notebook (and a good rule of thumb is to do this at the very top of your notebook so it is easy to see what packages your R code depends on).
Question 5.1. Use the library
function to load the rvest
R package
We will use this package next week to scrape data from the web!
6. Looking for help
Help Files
No one, even experienced, professional programmers remember what every function does, nor do they remember every possible function argument/option. So both experienced and new programmers (like you!) need to look things up, A LOT! One of the most efficient places to look for help on how a function works is the R help files. Let’s say we wanted to pull up the help file for the max()
function. We can do this by typing a question mark in front of the function we want to know more about:
At the very top of the file, you will see the function itself and the package it is in (in this case, it is base). Next is a description of what the function does. You’ll find that the most helpful sections on this page are “Usage”, “Arguments” and "Examples".
Usage gives you an idea of how you would use the function when coding--what the syntax would be and how the function itself is structured.
Arguments tells you the different parts that can be added to the function to make it more simple or more complicated. Often the “Usage” and “Arguments” sections don’t provide you with step by step instructions, because there are so many different ways that a person can incorporate a function into their code. Instead, they provide users with a general understanding as to what the function could do and parts that could be added. At the end of the day, the user must interpret the help file and figure out how best to use the functions and which parts are most important to include for their particular task.
The Examples section is often the most useful part of the help file as it shows how a function could be used with real data. It provides a skeleton code that the users can work off of.
Beyond the R help files there are many resources that you can use to find help. Stack overflow, an online forum, is a great place to go and ask questions such as how to perform a complicated task in R or why a specific error message is popping up. Oftentimes, a previous user will have already asked your question of interest and received helpful advice from fellow R users.
Question 6.1. Use ?read_csv
and read the Description section to answer the multiple choice question below. To answer the question assign the letter associated with the correct answer to a variable in the the code cell below:
Which statement below is accurate?
A. read_csv2()
uses ;
for separators, instead of ,
B. read_delim
is a special case of the read_csv
function.
C. These functions are useful for reading binary files, such as excel spreadsheets.
D. European countries commonly use :
as the decimal separator.
Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called answer
7. Exercise
Now that we have learned a little about Jupyter notebooks and R, let's load a real dataset into R and explore it. As we do this we will learn more about key data loading, wrangling and visualization functions in R.
Data about runners!
Researchers, Vickers and Vertosick performed a study in 2016 that aimed to identify what factors affect race performance of recreational runners so that they could build better models to predict 5 km, 10 km and marathon race times. Such models can help runners by suggesting changes they could make to modifiable factors, such as training, to help them improve race time. Unmodifiable factors in the model, such as age or sex, allow for fair comparisons to be made between different runners.
Vickers and Vertosick reasoned that their study is important because all previous research done to predict races times has focused on data from elite athletes. This biased data set means that the models generated from them do not necessarily do a good job predicting race times for recreational runners (whose data was not in the dataset that created the models). Additionally, previous research focused on reporting/measuring factors that require special expertise or equipment that are not freely available to recreational runners. This means that recreational runners may not be able to put their characteristics/measurements for these factors in the race time prediction models and so they will not be able to obtain an accurate prediction, or a prediction at all (in the case of some models).
To make a better model, Vickers and Vertosick performed a large survey. They put their survey on the news website Slate.com attached to a news story about race time prediction. They were able to obtain 2,497 responses. The survey included questions that allowed them to collect a data set that included:
age,
sex,
body mass index (BMI),
whether they are an edurance runner or speed demon,
what type of shoes they wear,
what type of training they do,
race time for 2-3 races they completed in the last 6 months,
self-rated fitness for each race,
and race difficulty for each race.
Let's now use this data to explore a question we might be interested in - is there a relationship between 5 km race time and body mass index (BMI) for women runners (if there is, then it might be a useful factor to include in a race time prediction model for these runners). We will answer this question by visualizing the data as a scatter plot using R. To accomplish this, we will need to do the following things in R:
load the data set into R
subset the data we are interested in visualizing from the loaded dataset
create a new column to get the unit of time in minutes instead of seconds
create a scatter plot using this modified data
Question 7.1 Which of the following will you not find included in Vickers and Vertosick's data set?
A. age
B. body mass index
C. self-rated fitness for each race
D. what each runner ate before the race
Assign your answer to an object called answer7.1
.
Question 7.2 True or False:
The researchers compiled this data so that they could build better models to predict marathon race times.
Assign your answer to an object called answer7.2
.
Question 7.3 What kind of graph will we be creating? Choose the correct answer from the options below.
A. Bar Graph
B. Pie Chart
C. Scatter Plot
D. Box Plot
Assign your answer to an object called answer7.3
.
Let's get started with our first step - loading the data set. The data set we are loading is called marathon_small.csv
and it contains a subset of the data from the study described above. The file is in the same directory/folder as the file for this notebook. It is a comma separated file (meaning the columns are separated by the ,
character). We often refer to these files as .csv
's.
We can use the read_csv
function to do this. Below is an example of reading a .csv
file that is in the same directory/folder as the file for the notebook that would be reading it in:
Note - the quotes around the filename are important and you will get an error if you forget them.
Question 7.4 Use the read_csv()
function to load the data from the marathon_small.csv
file into R. Save the data to an object called marathon_small
. If you need additional help try ?read_csv
and/or ask your neighbours or the Instructional team for help.
The pink output under the code cell above tells you a bit about what happened when read_csv
read the data into R. It tells you that 5 columns were created (names: age, bmi, km5_time_seconds, km10_time_seconds and sex) as well as the type of the data in those columns (e.g., number-type or text-type), specifically:
col_double
means that the data in this column is a number-type, specifically real numbers (meaning that these values can contain decimals)col_integer
means that the data in this column is a number-type, specifically integers (whole numbers)col_character
means that the data in this column contains text (e.g., letter or words)
Question 7.5 From the list below, which is a valid way to store a data frame object read in from read_csv
to an object in R?
A. data -> read_csv("example_file.csv")
B. data <- read_csv("example_file.csv")
C. data <- read_csv"example_file.csv"
D. data <- read_csv(example_file.csv)
Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called answer7.5
.
Data frames
We can look at the structure of the data frame using the function head()
.
head()
returns the first 6 parts of a vector or data frame.
By default, the first row of a data set is always the header that read_csv
uses to label the column. Therefore, the first row contains descriptive names while the rows below contain the actual data.
This only shows us a small portion of the data set. You can look at the entire data set by simply running a cell with marathon_small
(data frame name) written in it but that can be very long and unnecessary to look at.
Question 7.6 To know how many rows there really are, use the function nrow()
. Replace the fail()
with your line of code. Assign the number of rows to the object number_rows
.
Filter
One of the most useful functions of tidyverse
is filter()
. With this function, it is possible to filter out specific observations based on their entries in one or more columns.
For example, if we had a data set (named data
) that looked like this:
we could use the first line of the code in the image below to filter for rows where the colour has the value of "blue". The seconde line of code below would let us filter for rows where the size has a value greater than 20.
Question 7.7 Use the function filter()
to subset your data frame marathon_small
so it only contains survey data from females. Assign your new filtered data frame to an object called marathon_filtered
. Replace the fail()
with your line of code.
Select
The select()
function allows you to zoom in and focus on specific parts of the data. It is particularly helpful when working with extremely large datasets. More specifically, the function allows you to separate one or more columns from your dataset and transfer them into their own data frame.
Remembering our example data
:
For example, we can use the function select()
to choose columns of interest (here colour and shape).
and we would get this smaller data set back:
Question 7.8 Use the function select
to choose the columns bmi
and km5_time_seconds
from marathon_filtered
. Assign your new filtered data frame to an object called marathon_female
.
Replace the fail()
with your line of code. Make sure you select first bmi
and then km5_time_seconds
!
Pipe Operators: %>%
Pipe operators allow you to chain together different functions - it takes the output of one statement and makes it the input of the next statement. Having a chain of processing functions is known as a pipeline.
For example, we can combine filter and select into one command:
blue_data <- filter(data, colour == "blue") %>% select(colour, size)
Since we want to specifically plot data of female participants, we need to first filter the sex column using the function: filter()
. Below, you can see how this function as well as pipe operators (%>%
) are used!. Then we need to select the column variables that we wish to look at. Since we want to plot BMI against the time it took to run 5 Kms, we must select bmi
and km5_time_seconds
accordingly. For this, we need to use the function: select()
.
The following cell shows you how we can chain together filter and select for the marathon dataframe.
Question 7.9 Why do we only write marathon_small (original data frame) for the function: filter()?
A. Because select does not require the original data frame as an argument.
B. Because the pipe operator uses the data frame in the first line as the data frame for all subsequent lines.
C. Because the pipe operator uses the output of the first function as the input of the second function.
Answer in the cell below using the uppercase letter associated with your answer. Place your answer between "", assign the correct answer to an object called answer7.9
.
Question 7.10 What are the units of the time taken to complete a run of 5 Kms?
Hint: scroll up and look at the introduction to this exercise.
Question 7.11 What are the units for time (e.g., seconds, minutes, hours) that we would like to use when plotting BMI against time taken to run 5Kms? Hint: scroll up and look at the introduction to this exercise.
Mutate
The function mutate()
is used to add columns to an existing dataset where the new column is usually a function of one of more of the the existing columns.
Question 7.12
Add a new column to our marathon_female dataset called km5_time_minutes
that is equal to km5_time_seconds/60.
Graphing
ggplot
is a function that works using layers of code. Every time you want to see something new added to your plot, you must add a new layer with each layer being separated by the “+” symbol. The first function we use in this line of code is the ggplot
function. Here, we indicate the arguments that apply to all layers of the plot. The second function we use is geom_point()
. This function indicates that we wish to produce a scatterplot and the way we wish to display the data within this scatterplot.
Let's plot a scatterplot with the bmi
on the x axis and km5_time_minutes
on the y axis.
Question 7.13 Looking at the graph above, choose a statement above that most reflects what we see?
A. There may be a postitive trend/relationship between 5 km run time and body mass index; as the value for for body mass index increases, so does the time it takes to run 5 km.
B. There may be a negative trend/relationship between 5 km run time and body mass index; as the value for for body mass index increases, the time it takes to run 5 km decreases.
C. There appears to be no trend/relationship between 5 km run time and body mass index; as the value for for body mass index increases we see neither an increase or decrease in the time it takes to run 5 km.
*Assign your answer to an object called answer7.13
.
The code we listed above for graphics barely scratches the surface of what ggplot, and R as a whole, are capable of. Not only are there far more choices about the kinds of plots available, but there are many, many options for customizing the look and feel of each graph. You can choose the font, the font size, the colors, the style of the axes, etc.
Let’s dig a little deeper into just a couple of options that you can add to any of your graphs to make them look a little better. For example, you can change the text of the x-axis label or the y-axis label by using xlab("")
or ylab("")
. Let’s do that for the scatterplot to make the labels easier to read.
Attributions
UC Berkley Data 8 Public Materials