Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/tutorial_02/tutorial_02.ipynb
2051 views
Kernel: R

Tutorial 2: Introduction to Reading Data

Any place you see ..., you must fill in the function, variable, or data to complete the code. Replace fail() with your completed code and run the cell!

### Run this cell before continuing. library(tidyverse) library(testthat) library(digest) library(repr) library(rvest) library(stringr)

1. Happiness Report

As you might remember from worksheet_02, we practiced loading data from the Sustainable Development Solutions Network's World Happiness Report. That data was the output of their analysis that calculated the happiness score and how much each variable contributed to each countries happiness score. In the tutorial, we are going to look at the data at an earlier stage of the study - the aggregated/averaged values (per country and per year) for many different social and health aspects that the researchers anticipated might contribute to happiness (Table2.1 from this Excel spreadsheet).

This goal for today is to produce a plot of 2017 healthy life expectancy at birth against positive affect, where we plot healthy life expectancy at birth on the x-axis and positive affect on the y-axis. For this study positive affect was defined as the average of three positive affect measures, specifically happiness, laugh and enjoyment. We would also like to covert the positive affect score from a scale of 0 - 1, to a scale from 0 - 10.

  1. use filter to subset the rows where the year is equal to 2017

  2. use select to subset the "Healthy life expectancy at birth" and "Positive affect" columns

  3. use mutate to convert the "Positive affect" score from a scale of 0 - 1, to a scale from 0 - 10

  4. use ggplot to create our plot of "Healthy life expectancy at birth" (x - axis) and "Positive affect" (y - axis)

Hints for success: Try going through all the steps on your own, but don't forget to talk to others (classmates, TAs, Instructor) if you need help getting unstuck. Work with different functions and if something doesn't work out, read the error message or use the help() function. Since there are a lot of steps to working and modifying data, feel free to look back at worksheet_02.

Question 1.1 Multiple Choice:

What is the maximum value for the Positive affect score (in the original data file that you read into R)?

A. 100

B. 10

C. 1

D. 0.1

E. 5

Assign your answer to an object called answer1.

# Assign your answer to an object called: answer1 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer1), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 1.2 Multiple Choice:

Which column's values will be used to filter the data?

A. countries

B. Generosity

C. Positive affect

D. year

Assign your answer to an object called answer2.

# Assign your answer to an object called: answer2 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer2), 'c1f86f7430df7ddb256980ea6a3b57a4') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 1.3.0

Use the appropriate read_* function to read in the WHR2018Chapter2OnlineData (look in the tutorial_02 directory to ensure you use the correct relative path to read it in) and name it happy_df_csv.

# your code here fail() # No Answer - remove if you provide an answer head(happy_df_csv)
test_that('Solution is incorrect', { expect_equal(ncol(happy_df_csv), 19) expect_equal(digest(as.numeric(sum(happy_df_csv$year))), '17201cb95c3eb622e848455bd480ef23') }) print("Success!")

Question 1.3.1

Above you loaded the data from a file we already downloaded and converted to a .csv for you. But you can also use the readxl R package to directly load in Excel files into R. Given that the data we loaded above (WHR2018Chapter2OnlineData.csv) was originally sourced from an Excel file on the web, let's now directly read that Excel file into R using the read_excel function from that package. This Excel file has multiple sheets, the data we want is on the first one.

Note:

read_excel does not yet support putting a URL as the file path argument. So we need to first download the file and write it to disk using R's download.file function, and then we can read that saved Excel file into R using read_excel.*

To answer the question fill in the blanks in the code below. If you are unsure, try reading the documentation for the new functions and asking others for help!

library(readxl) url = "https://s3.amazonaws.com/happiness-report/2018/WHR2018Chapter2OnlineData.xls" #download.file(..., destfile = ...) #happy_df <- read_excel(path = ..., sheet = ...) # your code here fail() # No Answer - remove if you provide an answer head(happy_df)
test_that('Solution is incorrect', { expect_equal(ncol(happy_df), 19) expect_equal(digest(as.numeric(sum(happy_df$year))), '17201cb95c3eb622e848455bd480ef23') }) print("Success!")

Look at the column names - they contain spaces!!! This is not a best practice and will make it difficult to use our tidyverse functions... Run the cell below to use the make.names function that will replace all the spaces with a . so we don't have this problem. The colnames function is also needed to access the data frames column names.

# run this cell colnames(happy_df) <- colnames(happy_df) %>% make.names() head(happy_df)

Question 1.3.2

Using the scaffolding given in the cell below, filter, select, and mutate the happy_df data frame as needed to get it ready to create our desired scatterplot. Name this new data frame reduced_happy_df

Replace the ... to complete your answer.

# happy_df <- happy_df %>% #...(year == ...) %>% #select(..., ...) %>% #...(Positive.affect = ...) # your code here fail() # No Answer - remove if you provide an answer head(reduced_happy_df)
test_that('Solution is incorrect', { expect_equal(digest(as.numeric(sum(reduced_happy_df$Positive.affect, na.rm = TRUE))), '3c52d24546a2dbf7437bfbf2b953614f') expect_equal(digest(as.numeric(sum(reduced_happy_df$Healthy.life.expectancy.at.birth, na.rm = TRUE))), 'dee33d6f7860c9997daacdbfded438cf') }) print("Success!")

Question 1.4

Using the modified data set, reduced_happy_df, generate the scatterplot descibed above and make sure to label the axes in proper written English.

Assign your answer to an object called happy_plot.

options(repr.plot.width = 4, repr.plot.height = 3) # happy_plot <- ggplot(reduced_happy_df, ...(x = ..., y = ...)) + #geom_...() + #...("...") + #ylab("Positive affect score (out of ...)") # your code here fail() # No Answer - remove if you provide an answer happy_plot
test_that('Solution is incorrect', { expect_that("Healthy.life.expectancy.at.birth" %in% c(rlang::get_expr(happy_plot$mapping$x), rlang::get_expr(happy_plot$layers[[1]]$mapping$x)) , is_true()) expect_that("Positive.affect" %in% c(rlang::get_expr(happy_plot$mapping$y), rlang::get_expr(happy_plot$layers[[1]]$mapping$y)) , is_true()) expect_that("GeomPoint" %in% c(class(happy_plot$layers[[1]]$geom)) , is_true()) }) print("Success!")

Question 1.5

In sentence or two, describe what you see in the scatterplot above? Does there appear to be a correlation between life expectancy at birth and postive affect?

YOUR ANSWER HERE

Question 1.6

Choose another column in the data set to plot against postive affect. Create the plot below (and remember to create propoer English axis labels).

# your code here fail() # No Answer - remove if you provide an answer

Question 1.7

In a sentence or two, describe what you see in the scatterplot above? Does there appear to be a correlation between postive affect and the other variable you plotted?

YOUR ANSWER HERE

2. Whistler Snow

Skiing and snowboarding are huge in British Columbia. Some of the most perfect slopes are found close to home. In fact, just two hours north of Vancouver, we reach the mountain-bearing city of Whistler. With cold weather and plenty of snowfall, Whistler becomes the ideal destination for winter sports fanatics. Moreover, the 2010 Winter Olympics were also held here.

One thing skiiers and snowboarders want is fresh snow! When are they most likely to find this? In the data directory we have two year long data sets from Environment Canada from the Whistler Roundhouse Station (on Whistler mountain). This weather station is located 1,835 m above sea level.

To answer the question when are skiiers and snowboarders most likely to find fresh snow (and when you should plan your next ski trip to Whistler) you will create a line plot where the date is on the x-axis and the total snow per day in centimetres (the column named Total Snow cm in the data file) is on the y-axis. Given that we have data for 2 years (2017 & 2018) we will create two plots, once for each year and see if there is any consistent trend we can observe across the two years.

Question 2.1 Multiple Choice:

What are we going to plot on the y-axis?

A. total precipitation per day in centimetres

B. total snow per on the ground in centimetres

C. total snow per day in centimetres

D. total rain per day in centimetres

Assign your answer to an object called answer2.1.

# Assign your answer to an object called: answer2.1 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(as.character(answer2.1)), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 2.2.0

Read in the file named eng-daily-01012018-12312018.csv from the data directory. Preview the file to choose the correct read_* function and argument values to get the data into R. Name it whistler_2018.*

# your code here fail() # No Answer - remove if you provide an answer head(whistler_2018)
test_that('Solution is incorrect', { expect_equal(ncol(whistler_2018), 27) expect_equal(digest(as.numeric(sum(whistler_2018$Snow_ground, na.rm = TRUE))), '908d1fd10b357ed0ceaaec823abf81bc') }) print("Success!")

Question 2.2.1

Looking at the column names of the whistler_2018 data frame you can see we have the white space in column names problem again. Use make.names to remove these.

# your code here fail() # No Answer - remove if you provide an answer colnames(whistler_2018)
test_that('Solution is incorrect', { expect_equal(ncol(whistler_2018), 27) expect_equal(length(grep(pattern = " ", x = colnames(whistler_2018))), 0) }) print("Success!")

Question 2.3

Create a line plot where the Date is on the x-axis and the total snow per day in centimetres is on the y-axis by filling in the ... in the code below. Ensure you create axis labels that are formatted in proper English. Assign your plot to an object called whistler_2018_plot.

options(repr.plot.width = 8, repr.plot.height = 2.5) # ... <- ...(..., aes(x = ..., y = ...)) + #geom_line() + #xlab(...) + #ylab(...) + #scale_x_date(date_breaks = "1 month") + # labels every month #theme(axis.text.x = element_text(angle = 90, hjust = 1)) # rotates x axis labels to be vertical # your code here fail() # No Answer - remove if you provide an answer whistler_2018_plot
test_that('Solution is incorrect', { expect_that("Date.Time" %in% c(rlang::get_expr(whistler_2018_plot$mapping$x), rlang::get_expr(whistler_2018_plot$layers[[1]]$mapping$x)) , is_true()) expect_that("Total.Snow..cm." %in% c(rlang::get_expr(whistler_2018_plot$mapping$y), rlang::get_expr(whistler_2018_plot$layers[[1]]$mapping$y)) , is_true()) expect_that("GeomLine" %in% c(class(whistler_2018_plot$layers[[1]]$geom)) , is_true()) }) print("Success!")

Question 2.4

Looking at the line plot above for 2018, of the months when is snowed, what months had the most fresh snow?

YOUR ANSWER HERE

Question 2.5

Repeat the data loading and plot creation using the file eng-daily-01012017-12312017.csv to visualize the same data for the year 2017.

# your code here fail() # No Answer - remove if you provide an answer

Question 2.6

Looking at the line plot above for 2017, of the months when is snowed, what months had the most fresh snow? Are these the same as in 2018 (hint - you might want to add a code cell where you plot the two plots right after each other so you can easily compare them in one screen view)? Is there any advantage of looking at 2 years worth of data?

YOUR ANSWER HERE

3. Reading Data from the Internet

Question 3.0

More practice scraping! To keep ourselves out of legal hot water, we will get more practice scraping data using a website that was created for that purpose: http://books.toscrape.com/

Your task here is to scrape the prices of the science fiction novels on this page and determine the maximum, minimum and average price of science fiction novels at this bookstore. Tidy up and nicely present your results by creating a data frame called sci_fi_stats that has 2 columns, one called stats that contains the words max, min and mean and once called value that contains the calculated value for each of these.

The functions for maximum, minimum and average in R are listed in the table below:

Calculation to performFunction in R
maximummax
minimummin
averagemean

Some other helpful hints:

  • If you end up scraping some characters other than numbers you will have to use str_replace_all from the stringr library to remove them (similar to what we did with the commas in worksheet_02).

  • Use as.numeric to convert your character type numbers to numeric type numbers before you pass them into the max, min and mean functions.

  • If you have NA values in your objects that you need to pass into the max, min and mean functions, you will need to set the na.rm argument in these functions to TRUE.

  • use the function c to create the vectors that will go in your data frame, for example, to create a vector with the values 10, 16 and 13 named ages, we would type: ages <- c(10, 16, 13).

  • use the function tibble to create the data frame from your vectors.

# your code here fail() # No Answer - remove if you provide an answer sci_fi_stats

Question 4.1

In worksheet_02 you had practice scraping data from the web. Now that you have the skills, should you go scrape that cool/interesting website you have been dreaming of harvesting data from? Maybe, but maybe not... You really should check the website's Terms of Service first and also consider the application you have planned for the data after you scrape it.

List 3 websites you might be intersted in scraping data from (for fun, for profit, for research/education). List their URLs as part of your answer. For these 3 websites, search for their Terms of Service page. Tell us if such a page exists, and if it does, provide the link to it as well as tell us whether or not they allow web scraping of their website.

YOUR ANSWER HERE

Bonus/optional additional readings on legalities of web scraping:

Here are two news stories about web scraping and their legal implications this year: