GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_02/worksheet_02.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 2: Introduction to Reading Data

You can read more about course policies on the course website.

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

read data using an absolute path
read data using a relative path
read data from the internet using a URL
understand the difference between:
- read_csv
- read_tsv
- read_csv2
- read_delim
match the following tidyverse read_* function arguments to their descriptions:
- file
- delim
- col_names
- skip
choose the appropriate tidyverse read_* function and function arguments to load a given tabular data set into R.
identify metadata in a file and read the fle using the argument skip
use the rvest html_nodes and html_text functions to scrape data from a .html file on the web
compare downloading tabular data from a plain text file (e.g. *.csv) from the web versus scraping data from a .html file

This worksheet covers parts of Chapter 2 of the online textbook. You should read this chapter before attempting the worksheet.

In [ ]:

### Run this cell before continuing. 
library(tidyverse)
library(testthat)
library(digest)
library(repr)

1. Comparing Absolute Paths, Relative Paths, and URLs

Question 1.1 Multiple Choice:

If you needed to read a file using an absolute path, what would be the first symbol in your argument (...) when using the read_csv function?

A. read_csv(">...")

B. read_csv(";...")

C. read_csv("...")

D. read_csv("/...")

Assign your answer to an object called answer1.

In [ ]:

# Assign your answer to an object called: answer1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_that(exists('answer1'), is_true())
    expect_equal(digest(answer1), 'c1f86f7430df7ddb256980ea6a3b57a4') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.2 True or False:

The file argument in the read_csv function that uses an absolute path can never look like that of a relative path?

Assign your answer to an object called answer2.

In [ ]:

# Assign your answer to an object called: answer2
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2), '05ca18b596514af73f6880309a21b5dd') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.3 Match the following arguments with the corresponding path that they represent:

Definitions

A. /Users/my_user/Desktop/UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx

B. https://www.ubc.ca

C. file_1.csv

D. /Users/name/Documents/Course_A/homework/my_first_homework.docx

E. homework/my_second_homework.docx

F. https://www.random_website.com

Functions

absolute
relative
URL

For every argument, create an object using the letter associated with the example and assign it to the corresponding number from the list of path types. For example: B <- 1

In [ ]:

# Assign your answer to a letter: A, B, C, D, E, F
# Make sure the correct answer is a numerical number from 1-6 
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(A), '6717f2823d3202449301145073ab8719') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(B), 'e5b57f323c7b3719bbaaf9f96b260d39') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(C), 'db8e490a925a60e62212cefc7674ca02') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(D), '6717f2823d3202449301145073ab8719') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(E), 'db8e490a925a60e62212cefc7674ca02') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(F), 'e5b57f323c7b3719bbaaf9f96b260d39') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.4

If the absolute path to a data file looks like this: /Users/my_user/Desktop/UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx

What would the relative path look like if the working directory (i.e., where the Jupyter notebook is where you are running your R code from) is now located in the UBC folder?

A. read_csv("sn_trial_1.xlsx")

B. read_csv("/SciaticNerveLab/sn_trial_1.xlsx")

C. read_csv("BIOL363/SciaticNerveLab/sn_trial_1.xlsx")

D. read_csv("UBC/BIOL363/SciaticNerveLab/sn_trial_1.xlsx")

E. read_csv("/BIOL363/SciaticNerveLab/sn_trial_1.xlsx")

Assign your answer to an object called answer4.

In [ ]:

# Assign your answer to an object called: answer4
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer4), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

2. Argument Modifications to Read Data

Reading files is one of the first steps to wrangling data and consequently read_csv is a crucial function. However, despite how effortlessly it has worked so far, it has its limitations. read_csv works with particular files and does not accept differing formats.

Not all data sets come as perfectly organized like the ones you worked with last week. Time and effort was put into ensuring that the files were arranged with headers, columns were separated by commas, and the beginning excluded metadata.

Now that you understand how to read files located outside (or inside) of your working directory, you can begin to learn the tips and tricks necessary to overcoming the setbacks of read_csv.

In [ ]:

### Run this cell to learn more about the arguments used in read_csv
### Reading over the help file will assist with the next question. 

?read_csv

Question 2.1

Match the following definitions with the corresponding arguments used in read_csv:

Definitions

G. Character that separates columns in your file.

H. Specifies whether or not the first row of data in your file are column labels. Also allows you to create a vector that can be used to label columns.

I. This is the file name, path to a file, or URL.

J. Specifies the number of lines which must be ignored because they contain metadata.

Functions

file
delim
col_names
skip

For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list of functions. For example: G <- 1

In [ ]:

# Assign your answer to a letter: G, H, I, J
# Make sure the correct answer is a numerical number from 1-4
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(G), 'db8e490a925a60e62212cefc7674ca02') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(H), 'e5b57f323c7b3719bbaaf9f96b260d39') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(I), '6717f2823d3202449301145073ab8719') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(J), 'dbc09cba9fe2583fb01d63c70e1555a8') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.2 True or False:

read_csv2 and read_delim can both be used for reading files that have columns separated by ;.

Assign your answer to an object called answer2.2. Make sure to write in all lower-case.

In [ ]:

# Assign your answer to an object called: answer2.2
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.2), '05ca18b596514af73f6880309a21b5dd') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.3 Multiple Choice:

read_tsv would be used for files that have columns separated by which of the following:

A. letters

B. tabs

C. numbers

D. commas

Assign your answer to an object called answer2.3.

In [ ]:

# Assign your answer to an object called: answer2.3
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_that(exists('answer2.3'), is_true())
    expect_equal(digest(answer2.3), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

3. Happiness Report (2017)

This data was taken from Kaggle and ranks countries on happiness based on rationalized factors like economic growth, social support, etc. The data was released by the United Nations at an event celebrating International Day of Happiness. According to the website, the file contains the following information:

Country = Name of the country.
Region = Region the country belongs to.
Happiness Rank = Rank of the country based on the Happiness Score.
Happiness Score = A metric measured by asking the sampled people the question: "How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest."
Standard Error = The standard error of the happiness score.
Economy (GDP per Capita) = The extent to which GDP contributes to the calculation of the Happiness Score.
Family = The extent to which Family contributes to the calculation of the Happiness Score.
Health (Life Expectancy) = The extent to which Life expectancy contributed to the calculation of the Happiness Score.
Freedom = The extent to which Freedom contributed to the calculation of the Happiness Score.
Trust (Government Corruption) = The extent to which Perception of Corruption contributes to Happiness Score.
Generosity = The extent to which Generosity contributed to the calculation of the Happiness Score.
Dystopia Residual = The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.

To clean up the file and make it easier to read, we only kept the country name, happiness score, economy (GDP per capita), life expectancy, and freedom.

Kaggle stores this information but it is compiled by the Sustainable Development Solutions Network. They survey these factors nearly every year (since 2012) and allow global comparisons to optimize political decision making. These landmark surveys are highly recognized and allow countries to learn and grow from one another. One day, they will provide a historical insight on the nature of our time.

Question 3.1 Fill in the Blank:

Trust is the extent to which _______________ contributes to the Happiness Score.

A. Corruption

B. Government Intervention

C. Perception of Corruption

D. Tax Money Designation

Assign your answer to an object called answer3.1.

In [ ]:

# Assign your answer to an object called: answer3.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer3.1), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.2 Multiple Choice:

What is the happiness report?

A. Study conducted by the governments of multiple countries.

B. Independent survey that was sampled by citizens.

C. Study conducted by the UN.

D. Survey given to international students by UBC's psychology department.

Assign your answer to an object called answer3.2.

In [ ]:

# Assign your answer to an object called: answer3.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer3.2), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.3 Read the file "happiness_report.csv" using the shortest relative path. *hint - preview the data using Jupyter (as shown in this video) so you know which read_* function and which arguments to use).

Note, this file is found in the same folder as the worksheet you are completing.

Assign your answer to an object called happiness_report.

In [ ]:

# Load happiness_report.csv using read_csv and name it: happiness_report

# your code here
fail() # No Answer - remove if you provide an answer
head(happiness_report, n = 10) # the n = 10 argument tells head to print 10 lines instead of the default 6

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(ncol(happiness_report)), 'dd4ad37ee474732a009111e3456e7ed7')
    expect_equal(digest(nrow(happiness_report)), 'aa46ec0eda6b9268581f7b6334fe5368')
    expect_equal(digest(as.numeric(sum(happiness_report$GDP_per_capita))), 'cb8c845eeb80c799ec661eba30abe253') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.4 Multiple Choice:

If Norway is in "first place" based on the happiness score, at what position is Canada?

A. 3rd

B. 15th

C. 7th

D. 28th

Hint: create a new cell and run happiness_report.

Assign your answer to an object called answer3.4.

In [ ]:

# Assign your answer to an object called: answer3.4
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer3.4), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.5

Open all the files in the "data" folder in your working directory (which should be the tutorial_01 directory) using Jupyter (again, this video shows you how to do this). This will allow you to visualize the files and the organization of your data. Based on your findings, fill in the table below. This table will be very useful to refer back to in the coming weeks.

Double click on this cell to edit and fill out your table! We have filled the first row as an example.

Fill in your answers between the |.

File Name	delim	Header (yes/no)	Metadata (yes/no)	# lines to skip	`read_*` function
`happiness_report.csv`	`,`	yes	no	NA	`read_csv`

YOUR ANSWER HERE

For the questions below, fill in the ... in the cells below. Copy and paste your finished answer into the fail(). Refer to your table and don't be afraid to ask for help.

Question 3.6.1

Read in the file happiness_report_semicolon.csv using read_delim and name it happy_semi_df

In [ ]:

# happy_df <- read_delim(file = "data/...", delim = "...")

# your code here
fail() # No Answer - remove if you provide an answer
head(happy_semi_df)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(ncol(happy_semi_df)), 'dd4ad37ee474732a009111e3456e7ed7')
    expect_equal(digest(nrow(happy_semi_df)), 'aa46ec0eda6b9268581f7b6334fe5368')
    expect_equal(digest(sum(happy_semi_df$life_expectancy)), '17c02c6af15ebbf168dbc8a7766166c0') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.6.2

Read in the file happiness_report_semicolon.csv again, but this time use a different read_* function than read_delim (but that also works). Name it happy_semi_df2.

In [ ]:

# happy_semi_df2 <- ...("...")

# your code here
fail() # No Answer - remove if you provide an answer
head(happy_semi_df2)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(happy_semi_df2$happiness_score))), 'b09ed5fd56dcd21d2a2657ab4142da1f') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.6.3

Read in the file happiness_report.tsv using the appropriate read_* function and name it happy_tsv.

In [ ]:

# happy_tsv <- ...(file = "...")

# your code here
fail() # No Answer - remove if you provide an answer
head(happy_tsv)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(happy_tsv$life_expectancy))), '17c02c6af15ebbf168dbc8a7766166c0') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.6.4

Read in the file happiness_report_metadata.csv using the appropriate read_* function and name it happy_metadata.

In [ ]:

# happy_metadata <- ...("data/happiness_report_metadata.csv", skip = ...)

# your code here
fail() # No Answer - remove if you provide an answer
head(happy_metadata)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(happy_metadata$freedom))), 'd4de4637af7167e9136fe25573414bf6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.6.5

Read in the file happiness_report_no_header.csv using the appropriate read_* function and name it happy_header.

In [ ]:

# happy_header <- ...("...", col_names = c("country", "happiness_score", "GDP_per_capita", "life_expectancy", "freedom"))

# your code here
fail() # No Answer - remove if you provide an answer
head(happy_header)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(happy_header$freedom))), 'd4de4637af7167e9136fe25573414bf6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.7

Opening the data on a text editor showed some clear differences. Do all the data sets look the same once reading them on your R notebook?

yes no

Assign your answer to an object called answer3.7. Make sure to write in all lower-case.

In [ ]:

# Assign your answer to an object called: answer3.7
# Make sure the correct answer is written in lower-case (yes / no)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer3.7), '0590b0427c1b19a6eb612d19888aa52f') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.8

Using the happy_header data set that you read earlier, plot GDP_per_capita vs. life_expectancy.

Assign your answer to an object called header_plot. Make sure to use xlab and ylab to label your axes appropriately.

In [ ]:

options(repr.plot.width=4, repr.plot.height=3)
# Assign your plot to an object called: header_plot

# your code here
fail() # No Answer - remove if you provide an answer
header_plot

In [ ]:

test_that('GDP_per_capita should be on the x-axis', {
    expect_that("GDP_per_capita" %in% c(rlang::get_expr(header_plot$mapping$x), rlang::get_expr(header_plot$layers[[1]]$mapping$x)) , is_true())
    })
test_that('life_expectancy should be on the y-axis', {
    expect_that("life_expectancy" %in% c(rlang::get_expr(header_plot$mapping$y), rlang::get_expr(header_plot$layers[[1]]$mapping$y)) , is_true())
    })
test_that('geom should be geom_point', {
    expect_that("GeomPoint" %in% c(class(header_plot$layers[[1]]$geom)) , is_true())
    })
print("Success!")

4. Reading Data from the Internet

How has the World Gross Domestic product changed throughout history?

As defined on Wikipedia, the "Gross world product (GWP) is the combined gross national product of all the countries in the world." Living in our modern age with our roaring (sometimes up and sometimes down) economies, one might wonder how the world economy has changed over history. To answer this question we will scrape data from the Wikipedia Gross world product page.

Your data set will include the following columns:

year
gwp_value

Specifically we will scrape the 2 columns named "Year" and "Real GWP" in the table under the header "Historical and prehistorical estimates". The end goal of this exercise is to create a line plot with year on the x-axis and GWP value on the y-axis.

Question 4.1.0 Multiple Choice:

Under which of the following headers is the table will we scrape from on the Wikipedia Gross world product page?

A. Gross world product

B. Recent growth

C. Historical and prehistorical estimates

D. See also

Assign your answer to an object called answer4.1.0.

In [ ]:

# Assign your answer to an object called: answer4.1.0
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer4.1.0)), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 4.1.1 Multiple Choice:

What is going to be the x-axis of the scatter plot we create?

A. compound annual growth rate

B. the value of the gross world product

C. year

Assign your answer to an object called answer4.1.1.

In [ ]:

# Assign your answer to an object called: answer4.1.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer4.1.1)), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

We need to now load the rvest package to begin our web scraping!

In [ ]:

# Run this cell 
library(rvest)

Question 4.2

Use read_html to download information from the URL given in the cell below.

Assign your answer to an object called golden_globes.

In [ ]:

# Assign your answer to an object called: gwp
# Instead of copying the entire URL, you can simply use the object (url) after read_html()

url <- 'https://en.wikipedia.org/wiki/Gross_world_product'

# your code here
fail() # No Answer - remove if you provide an answer
print(gwp)

In [ ]:

test_that('Solution is incorrect', {
    expect_that(is.list(gwp), is_true())
    expect_equal(digest(as.numeric(length(gwp))), 'db8e490a925a60e62212cefc7674ca02') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_that('xml_document' %in% attributes(gwp)$class, is_true())
})
print("Success!")

Question 4.3.0

Run the cell below to create the first column of your data set (the year from the table under the "Historical and prehistorical estimates" header). The node was obtained using SelectorGadget.

In [ ]:

# Run this cell to create the first column for your data set. 
# your code here
fail() # No Answer - remove if you provide an answer
year <- gwp %>%
    html_nodes(".wikitable tbody:nth-child(1) td:nth-child(1)") %>%
    html_text() 
head(year)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(head(year))), '60024ffc973833818fa99549a0a5ee93') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

We can see that although we want numbers for the year, the data we scraped includes the characters AD and \n (a newline character). We will have to do some string manipulation and then convert the years from characters to numbers.

First we use the str_replace_all function to match the string " AD\n" and replace it with nothing "":

In [ ]:

# run this cell
# use stringr library
library(stringr)
# replace " AD\n" with nothing
year <- str_replace_all(string = year, pattern = " AD\n", replacement = "")

print(year)

When we print year, we can see we were able to remove " AD\n", but we missed that there is also " BC\n" on the earliest years! There are also commas (",") in the large BC years that we will have to remove. We also need to put a - sign in front of the BC numbers so we don't confuse them with the AD numbers after we convert everything to numbers. To do this we will need to use a similar strategy to clean this all up!

This week we will provide you the code to do this cleaning, next week you will learn to do these kinds of things yourself. After we do all the string/text manipulation then we use the as.numeric function to convert the text to numbers.

In [ ]:

# run this cell to clean up the year data and convert it to a number
# use grep to select the lines containing " BC\n" and put a - at the beginning of them
year[grepl(pattern = " BC\n", x = year)] <- str_replace_all(string = year[grepl(pattern = " BC\n", x = year)], pattern = "^", replacement = "-")

# replace all commas with nothing
year <- str_replace_all(string = year, pattern = ",", replacement = "")
# extract the minus symbol and the numbers
year <- str_extract(string = year, pattern = "-?[0-9]+") %>% 
    as.numeric()

print(year)

Question 4.4

Create a new column for the gross world product (GWP) from the table we are scraping. Don't forget to use SelectorGadget to obtain the CSS selector needed to scrape the GWP values from the table we are scraping. Assign your answer to an object called gwp_value.

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Refer to Question 4.3 and don't be afraid to ask for help.

In [ ]:

# gwp_value <- gwp %>%
  # ...("...") %>%
  # html_text() 

# your code here
fail() # No Answer - remove if you provide an answer
head(gwp_value)

In [ ]:

test_that('Solution is incorrect', {
    expect_that(is.vector(gwp_value), is_true())
    expect_equal(digest(as.character(head(gwp_value))), '8aea984d028a6cb86143f8bd9c961a26') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Again, looking at the output of head(gwp_value) we see we have some cleaning and type conversions to do. We need to remove the commas, the extraneous trailing information in the first 3 columns, and the "\n" character again. We provide the code to do this below:

In [ ]:

# run this cell to clean up the year data and convert it to a number
# create a new variable called gwp_value_clean
gwp_value_clean <-  gwp_value
# replace all commas with nothing
gwp_value <- str_replace_all(string = gwp_value, pattern = ",", replacement = "")
# extract the numbers and decimals
gwp_value <- str_extract(string = gwp_value, pattern = "[0-9.]+") %>% 
    as.numeric()
print(gwp_value)

Question 4.5

Use the tidyverse tibble function to create a data frame named gwp with year and gwp_value as columns. The general form for the creating data frames from vectors/lists using the tibble function is as follows:

tibble(COLUMN1_NAME, COLUMN2_NAME, COLUMN3_NAME, ...)

In [ ]:

# create data.frame with columns year and gwp_value named gwp
# fill in the blanks in the code skeleton provided below
#... <- tibble(..., ...)

# your code here
fail() # No Answer - remove if you provide an answer
head(gwp)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(gwp$year))), 'a531a2d37eb64bf5b3ffb7ee17dbaedc')
    expect_equal(digest(as.numeric(sum(gwp$gwp_value))), 'f555139e88b37cd50d8c5d170ea575bd')
})
print("Success!")

One last piece of data transformation/wrangling we will do before we get to data visualization is to create another column called sqrt_year which scales the year values so that they will be more informative when we plot them (if you look at our year data we have a lot of years in the recent past, and fewer and fewer as we go back in time). Often times you can just transform the scale within ggplot (for example see what we do with the gwp_value later on), but the year value is tricky for scaling becuase it contains negative values. So we need to first make everything positive, then take the square root, and then re-transform the values that should be negative to negative again! We provide the code to do this below.

In [ ]:

gwp <- gwp %>% 
    mutate(sqrt_year = sqrt(abs(year)))  %>% 
    mutate(sqrt_year = if_else(year < 0, sqrt_year * -1, sqrt_year))
head(gwp)

Question 4.6

Create a line plot using the gwp data frame where sqrt_year is on the x-axis and gwp_value is on the y-axis. We provide the plot code to relabel the x-axis with the human understandable years instead of the tranformed ones we plot. Name your plot object gwp_historical. To make a line plot instead of a scatter plot you should use the geom_line() function instead of the geom_point() function.

In [ ]:

# Assign your answer to an object called: gwp_historical
# Fill in the missing parts of the code below to make the plot

#... <- ggplot(gwp, aes(x = ..., y = ...)) +
    #geom_line() +
    #scale_y_continuous(trans='log10') +
    #scale_x_continuous(breaks = c(-1000, -750, -500, -250, -77.7, 0, 38.7), 
    #                   labels = c("-1000000", "-562500", "-250000", "-62500", "-5000", "0", "1500")) +
    #ylab("...") +
    #xlab("Year")

options(repr.plot.width=8, repr.plot.height=3)
# your code here
fail() # No Answer - remove if you provide an answer
gwp_historical

In [ ]:

test_that('Solution is incorrect', {
    expect_that("sqrt_year" %in% c(rlang::get_expr(gwp_historical$mapping$x), rlang::get_expr(gwp_historical$layers[[1]]$mapping$x)) , is_true())
    expect_that("gwp_value" %in% c(rlang::get_expr(gwp_historical$mapping$y), rlang::get_expr(gwp_historical$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomLine" %in% c(class(gwp_historical$layers[[1]]$geom)), is_true())
    })
print("Success!")

Question 4.7

Looking at the line plot, when does the Gross World Dompestic Product first start to more rapidly increase (i.e., when does the slope of the line first change)?

A. roughly around year -1,000,000

B. roughly around year -250,000

C. roughly around year -5000

D. roughly around year 1500

Assign your answer to an object called answer4.7.

In [ ]:

# Assign your answer to an object called: answer4.7
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
print(answer4.7)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer4.7)), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")