Path: blob/master/2020-spring/materials/tutorial_02/tutorial_02.ipynb
2051 views
Tutorial 2: Introduction to Reading Data
Any place you see ...
, you must fill in the function, variable, or data to complete the code. Replace fail()
with your completed code and run the cell!
1. Happiness Report
As you might remember from worksheet_02, we practiced loading data from the Sustainable Development Solutions Network's World Happiness Report. That data was the output of their analysis that calculated each country's happiness score and how much each variable contributed to it. In the tutorial, we are going to look at the data at an earlier stage of the study - the aggregated/averaged values (per country and per year) for many different social and health aspects that the researchers anticipated might contribute to happiness (Table2.1 from this Excel spreadsheet).
The goal for today is to produce a plot of 2017 positive affect against healthy life expectancy at birth, where we plot healthy life expectancy at birth on the x-axis and positive affect on the y-axis. For this study, positive affect was defined as the average of three positive affect measures, specifically happiness, laugh and enjoyment. We would also like to convert the positive affect score from a scale of 0 - 1 to a scale from 0 - 10.
use
filter
to subset the rows where the year is equal to 2017use
select
to subset the "Healthy life expectancy at birth" and "Positive affect" columnsuse
mutate
to convert the "Positive affect" score from a scale of 0 - 1 to a scale from 0 - 10use
ggplot
to create our plot of "Healthy life expectancy at birth" (x - axis) and "Positive affect" (y - axis)
Hints for success: Try going through all the steps on your own, but don't forget to talk to others (classmates, TAs, Instructor) if you need help getting unstuck. Work with different functions and if something doesn't work out, read the error message or use the help()
function. Since there are a lot of steps to working and modifying data, feel free to look back at worksheet_02
.
Question 1.1 Multiple Choice:
{points: 1}
What is the maximum value for the Positive affect score (in the original data file that you read into R)?
A. 100
B. 10
C. 1
D. 0.1
E. 5
Assign your answer to an object called answer1
.
Question 1.2 Multiple Choice:
{points: 1}
Which column's values will be used to filter the data?
A. countries
B. generosity
C. positive affect
D. year
Assign your answer to an object called answer2
.
Question 1.3.0
{points: 1}
Use the appropriate read_*
function to read in the WHR2018Chapter2OnlineData
(look in the tutorial_02
directory to ensure you use the correct relative path to read it in) and name it happy_df_csv
.
Question 1.3.1
{points: 1}
Above, you loaded the data from a file we already downloaded and converted to a .csv
for you. But you can also use the readxl
R package to directly load in Excel files into R. Given that the data we loaded above (WHR2018Chapter2OnlineData.csv
) was originally sourced from an Excel file on the web, let's now directly read that Excel file into R using the read_excel
function from that package. This Excel file has multiple sheets, the data we want is on the first one.
Note:
read_excel
does not support putting a URL as the file path argument. So we need to first download the file and write it to disk using R'sdownload.file
function, and then we can read that saved Excel file into R usingread_excel
.
To answer the question, fill in the blanks in the code below. If you are unsure, try reading the documentation for the new functions and ask others for help!
Look at the column names - they contain spaces!!! This is not a best practice and will make it difficult to use our tidyverse functions... Run the cell below to use the make.names
function that will replace all the spaces with a .
so we don't have this problem. The colnames
function is also needed to access the data frame's column names.
Question 1.3.2
{points: 1}
Using the scaffolding given in the cell below, filter, select, and mutate the happy_df
data frame as needed to get it ready to create our desired scatterplot. Use select
to subset the "Healthy life expectancy at birth" and "Positive affect" columns. Name this new data frame reduced_happy_df
Replace the ...
to complete your answer.
Question 1.4
{points: 3}
Using the modified data set, reduced_happy_df
, generate the scatterplot described above and make sure to label the axes in proper written English.
Assign your answer to an object called happy_plot
.
Question 1.5
{points: 2}
In one sentence or two, describe what you see in the scatterplot above? Does there appear to be a correlation between life expectancy at birth and postive affect?
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
Question 1.6
{points: 3}
Choose any other column in the data set to plot against healthy life expectancy at birth. Create the scatter plot in the cell below (and remember to create proper English axis labels) with healthy life expectancy at birth on the x-axis. Name the plot happy_plot_2
.
Question 1.7
{points: 3}
In a sentence or two, describe what you see in the scatterplot above? Does there appear to be a correlation between healthy life expectancy at birth and the other variable you plotted?
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
2. Whistler Snow
Skiing and snowboarding are huge in British Columbia. Some of the most perfect slopes are found close to home. In fact, just two hours north of Vancouver, we reach the mountain-bearing city of Whistler. With cold weather and plenty of snowfall, Whistler is the ideal destination for winter sports fanatics.
One thing skiiers and snowboarders want is fresh snow! When are they most likely to find this? In the data
directory we have two year long data sets from Environment Canada from the Whistler Roundhouse Station (on Whistler mountain). This weather station is located 1,835 m above sea level.
To answer the question when are skiiers and snowboarders most likely to find fresh snow (and when you should plan your next ski trip to Whistler) you will create a line plot where the date is on the x-axis and the total snow per day in centimetres (the column named Total Snow cm
in the data file) is on the y-axis. Given that we have data for 2 years (2017 & 2018), we will create two plots, one for each year and see if there is any consistent trend we can observe across the two years.
Question 2.1 Multiple Choice:
{points: 1}
What are we going to plot on the y-axis?
A. total precipitation per day in centimetres
B. total snow on the ground in centimetres
C. total snow per day in centimetres
D. total rain per day in centimetres
Assign your answer to an object called answer2.1
.
Question 2.2.0
{points: 1}
Read in the file named eng-daily-01012018-12312018.csv
from the data directory. Preview the file to choose the correct read_*
function and argument values to get the data into R. Name it whistler_2018
.
Note: You'll see a lot of entries of the form NA
. This is the symbol R uses to denote missing data. Interestingly, you can do math and make comparisons with NA
: for example, NA + 1 = NA
, NA * 3 = NA
, NA > 3 = NA
. Most operations on NA
return NA
. This may seem a bit weird, but it makes things much simpler in R since it removes the need to write any special code to handle missing data!
Question 2.2.1
{points: 1}
Looking at the column names of the whistler_2018
data frame you can see we have the white space in column names problem again. Use make.names
to remove these.
Question 2.3
{points: 1}
Create a line plot where the date is on the x-axis and the total snow per day in centimetres is on the y-axis by filling in the ...
in the code below. Before you create your plot, use make.names
to remove the whitespace from all the column names. Do not modify the column names further or in any other way (otherwise the autograding won't work for this question).
Ensure you create axis labels that are formatted in proper English. Assign your plot to an object called whistler_2018_plot
.
Question 2.4
{points: 2}
Looking at the line plot above for 2018, of the months when it snowed, what top 2 months had the most fresh snow?
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
Question 2.5
{points: 3}
Repeat the data loading and plot creation using the file eng-daily-01012017-12312017.csv
to visualize the same data for the year 2017. Name the plot whistler_2017_plot
.
Question 2.6
{points: 3}
Looking at the line plot above for 2017, of the months when it snowed, what months had the most fresh snow? Are these the same as in 2018 (hint - you might want to add a code cell where you plot the two plots right after each other so you can easily compare them in one screen view)? Is there any advantage of looking at 2 years worth of data?
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
3. Reading from a Database
In worksheet_02
, you'll recall that we opened a database stored in a .db
file. This involved a lot more effort than just opening a .csv
, .tsv
, or any of the other plaintext / Excel formats. To be honest, it was a bit of a pain to use a database in that setting, since we had to use dbplyr
to translate tidyverse-like commands (filter
, select
, head
, etc) into SQL commands that the database understands. We didn't run into this problem in the worksheet, but not all tidyverse commands can currently be translated with SQLite databases. For example, with an SQLite database, we can compute a mean, but can't easily compute a median.
Why should we bother with databases at all then?
Databases become really useful in a large-scale setting:
they enable storing large datasets across multiple computers with automatic redundancy and backups
they enable multiple users to access them simultaneously and remotely without conflicts and errors
they provide mechanisms for ensuring data integrity and validating input
they provide security to keep data safe
For example: there are around 4 billion Google searches conducted daily as of 2019. Can you imagine if Google stored all of the data from those queries in a single .csv
file!? Chaos would ensue.
To reap the real benefits of databases, we'll need to move to a more fully-powered one: PostgreSQL. We'll begin by loading the DBI
and dbplyr
packages that R uses to talk to databases, as well as the RPostgres
package that provides the interface between these packages and PostgreSQL databases (note the similarity to the RSQLite
package from worksheet_02
).
Investigating Trends in Crowdfunding
Kickstarter is an online crowd-funding site where people can post projects they want to do, but don't have the financial resources needed to complete them by themselves. Others can pledge money to the project, and the pledges are fulfilled if the funding reaches a goal amount. It's been around for a few years now; in this section, we'll investigate how the amount of funding successful projects get has changed over time.
Question 3.0
{points: 2}
Databases are often stored remotely (i.e., not on your computer or on this JupyterHub). Your first task is to load the Kickstarter data from a PostgreSQL database stored remotely on the UBC statistics network.
URL: r7k3-mds1.stat.ubc.ca
Username: dsci100
Password: dsci100
Database Name: kickstarter
Table Name: projects
We've provided the code to do this below. Replace each ...
with one of the 5 above items.
Note 1: Due to the UBC firewall, to get this to work you'll need to be connected to the UBC network or use the UBC VPN.
Note 2: As this database will be used by the entire class, you will only have read access (no write permissions).
Assign the resulting database connection object to connection
and the project table data to project_data
.
We can now use the colnames
function to see what columns are available in the project_data
table.
Question 3.1
{points: 1}
If we want to plot compare pledged and goal amounts of funding over time for successful projects, which three columns should we select
from the table?
A. id
, slug
, pledged
B. pledged
, goal
, deadline
C. pledged
, usd_pledged
, location_id
D. currency
, state
, country
Assign your answer to a variable named answer3.1
. For example, answer3.1 <- 'F'
.
Question 3.2
{points: 3}
Now we'll visualize the data. In order to do this, we need to take the correct subset of data from the table and use ggplot
to plot the result. Note that we make the scatter plot slightly transparent (using alpha = 0.01
in the code below) because there is so much data that it would otherwise be hard to see anything (overplotting).
In the below cell, you'll see some lines of code (currently commented out with #
characters). Remove the comments and rearrange these lines of code to plot the ratio of pledged and goal funding as a function of project deadline date for all successful (where pledged funding is greater than goal funding) projects in the dataset.
Note: there is a lot of data to plot here, so give it a moment to display!
Hint: you'll want to put all the dataframe manipulation functions first, and then the plotting functions afterward. Also note that some lines have a +
at the end, meaning they're in the middle of the plotting code!
Question 3.3
{points: 3}
Describe the relationship (if any) between the ratio of pledged/goal funding and time, as well as any patterns in the data that you might not have expected in advance.
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
Question 3.4
{points: 1}
Finally, we'll save the project data to a local file in the data/
folder called project_data.csv
. Recall that we don't want to try to download and save the entire dataset (way too much data!) from the database, but only the tbl
object named prj
. So you will need to use the collect
function followed by the appropriate write_*
function.
Assign the output of collect to an object called project_df
4 (Optional). Reading Data from the Internet
Question 4.0
More practice scraping! To keep ourselves out of legal hot water, we will get more practice scraping data using a website that was created for that purpose: http://books.toscrape.com/
Your task here is to scrape the prices of the science fiction novels on this page and determine the maximum, minimum and average price of science fiction novels at this bookstore. Tidy up and nicely present your results by creating a data frame called sci_fi_stats
that has 2 columns, one called stats
that contains the words max
, min
and mean
and once called value
that contains the calculated value for each of these.
The functions for maximum, minimum and average in R are listed in the table below:
Calculation to perform | Function in R |
---|---|
maximum | max |
minimum | min |
average | mean |
Some other helpful hints:
If you end up scraping some characters other than numbers you will have to use
str_replace_all
from thestringr
library to remove them (similar to what we did with the commas in worksheet_02).Use
as.numeric
to convert your character type numbers to numeric type numbers before you pass them into themax
,min
andmean
functions.If you have
NA
values in your objects that you need to pass into themax
,min
andmean
functions, you will need to set thena.rm
argument in these functions toTRUE
.use the function
c
to create the vectors that will go in your data frame, for example, to create a vector with the values 10, 16 and 13 named ages, we would type:ages <- c(10, 16, 13)
.use the function
tibble
to create the data frame from your vectors.
Question 4.1
In worksheet_02 you had practice scraping data from the web. Now that you have the skills, should you go scrape that cool/interesting website you have been dreaming of harvesting data from? Maybe, but maybe not... You really should check the website's Terms of Service first and also consider the application you have planned for the data after you scrape it.
List 3 websites you might be intersted in scraping data from (for fun, for profit, for research/education). List their URLs as part of your answer. For these 3 websites, search for their Terms of Service page. Tell us if such a page exists, and if it does, provide the link to it as well as tell us whether or not they allow web scraping of their website.
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
Bonus/optional additional readings on legalities of web scraping:
Here are two news stories about web scraping and their legal implications this year: