GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_03/worksheet_03.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 3: Cleaning and Wrangling Data

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

define the term “tidy data”
discuss the advantages and disadvantages from storing data in a tidy data format
recall and use the following tidyverse functions for their intended data wrangling tasks:
- select
- filter
- mutate
match the following tidyverse function arguments to their descriptions:
- summarize
- group_by
- map
- gather

In [ ]:

### Run this cell before continuing. 

library(tidyverse)
library(testthat)
library(digest)
library(repr)

Question 0.1 Multiple Choice:

Which of the following does not characterize a tidy dataset?

A. each row is a single observation

B. each value should not be in a single cell

C. each variable is a single column

D. each value is a single cell

Assign your answer to an object called answer0.1.

In [ ]:

# Assign your answer to an object called: answer0.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer0.1)), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 0.2 Multiple Choice:

For which scenario would using one of the group_by() + summarize() be appropriate?

A. To apply the same function to every row.

B. To apply the same function to every column.

C. To apply the same function to groups of rows.

D. To apply the same function to groups of columns.

Assign your answer to an object called answer0.2.

In [ ]:

# Assign your answer to an object called: answer0.2
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer0.2)), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 0.3 Multiple Choice:

For which scenario would using one of the purrr *map* functions be appropriate?

A. To apply the same function to every row.

B. To apply the same function to every column.

C. To apply the same function to elements in a list.

D. All of the above.

Assign your answer to an object called answer0.3.

In [ ]:

# Assign your answer to an object called: answer0.3
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer0.3)), 'c1f86f7430df7ddb256980ea6a3b57a4') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

It is a well known that millennials LOVE avocado toast (joking, well mostly 😉), and so many restaurants will offer menu items that centre around this delicious food! Like many food items, avocado prices fluctuate. So a restaurant who wants to maximize profits on avocado-containing dishes might ask if there are times when the price of avocados are less expensive to purchase? If such times exist, this is when the restaurant should put avocado-containing dishes on the menu to maximize their profits for those dishes.

In [ ]:

Source: https://www.averiecooks.com/egg-hole-avocado-toast/

To answers this question we will analyze a data set of avocado sales from multiple US markets. This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Each row in the data set contains weekly sales data for a region. Sales include The data set spans the year 2015-2018.

Some relevant columns in the dataset:

average_price - The average price of a single avocado.
type - conventional or organic
year - The year
region - The city or region of the observation
small_hass_volume
large_hass_volume
extra_l_hass_volume
week - integer number for the calendar week in the year (e.g., first week of January is 1, and last week of December is 52).

To answer our question of whether there are times in the year when Avocados are typically less expensive (and thus we can make more profitable menu items with them at a restaurant) we will want to create a scatter plot of average_price (y-axis) versus Date (x-axis).

Question 1.1 Multiple Choice:

Which of the following is not included in the csv file?

A. Average price of a single avocado.

B. The farming practice (production with/without the use of chemicals).

C. Average price of a bag of avocados.

D. All options are included in the data set.

Assign your answer to an object called answer1.1.

In [ ]:

# Assign your answer to an object called: answer1.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
print(answer1.1)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer1.1)), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.2 Multiple Choice:

The rows in the data frame represent:

A. daily avocado sales data for a region

B. weekly avocado sales data for a region

C. bi-weekly avocado sales data for a region

D. yearly avocado sales data for a region

Assign your answer to an object called answer1.2.

In [ ]:

# Assign your answer to an object called: answer1.2
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
print(answer1.2)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer1.2)), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.3

First step to plotting total volume against average price: Read the file "avocado_prices.csv" using the shortest relative path. The data file was given to you along with this worksheet, but you will have to look to see where it is in the worksheet_03 directory to correctly load it. When you do this, you should also preview the file to help you choose an appropriate read_* function to read the data.

Assign your answer to an object called avocado.

In [ ]:

# Load avocado_prices.csv using read_csv and name it: avocado 
# your code here
fail() # No Answer - remove if you provide an answer
head(avocado)
tail(avocado) # let's you look at the last 6 rows of a data frame

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(ncol(avocado), 9)
    expect_equal(nrow(avocado), 17911)
    expect_equal(digest(as.numeric(sum(avocado$small_hass_volume))), '5c68093945e78d372cac50a8eacec8fa')
})
print("Success!")

Question 1.4 Multiple Choice

Why are the 2nd to 5th columns col_double instead of col_integer?

A. They aren't "real" numbers.

B. They contain decimals.

C. They are numbers created using text/letters.

D. They are col_integer...

Assign your answer to an object called answer1.4.

In [ ]:

# Assign your answer to an object called: answer1.4
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
print(answer1.4)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer1.4)), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.5

To answer our question, let's now create the scatter plot where we plot average_price on the y-axis versus Date on the x-axis. Fill in the ... in the cell below. Copy and paste your finished answer into the fail(). Assign your answer to an object called avocado_plot. Don't forget to create proper English axis labels.

In [ ]:

options(repr.plot.width = 4, repr.plot.height = 3) # Modifies the size of the plots
#... <- ... %>%
#    ggplot(aes(x = ..., y = ...)) + 
#        geom_...() +
#        xlab("...") + 
#        ylab("...")

# your code here
fail() # No Answer - remove if you provide an answer
avocado_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_that("Date" %in% c(rlang::get_expr(avocado_plot$mapping$x), rlang::get_expr(avocado_plot$layers[[1]]$mapping$x)) , is_true())
    expect_that("average_price" %in% c(rlang::get_expr(avocado_plot$mapping$y), rlang::get_expr(avocado_plot$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(avocado_plot$layers[[1]]$geom)) , is_true())
    })
print("Success!")

We might be able to squint and start to see some pattern in the data above, but really what we see in the plot above is not very informative. Why? Because there is a lot of overplotting (data points sitting on top of other data points). What can we do? One solution is to reduce/aggregate the data in a meaningful way to help anwer our question. Remember that we are interested in determining if there are times when the price of avocados are less expensive so that we can recommend when restaurants should put dishes on the menu that contain avocado to maximize their profits for those dishes.

In the data we plotted above, each row is the total sales for avocados for that region for each year. Lets use group_by + summarize calculate the average price for each week across years and region. We can then plot that aggregated price against the week and perhaps get a clearer picture.

Question 1.6

Create a reduced/aggregated version of the avocado data set and name it avocado_aggregate. To do this you will want to group_by the week column and then use summarize to calculate the average price (name that column average_price).

In [ ]:

#... <- ... %>% 
#    group_by(...) %>% 
#    summarize(... = mean(average_price, na.rm = TRUE))

# your code here
fail() # No Answer - remove if you provide an answer
head(avocado_aggregate)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(avocado_aggregate$average_price))), '124d515821b05dcb2a472f2999ca770e')
    expect_equal(digest(as.numeric(sum(avocado_aggregate$week))), 'd27e825e408c446c586593f719b5545e')
     })
print("Success!")

Question 1.7

Now let's take the avocado_aggregate data frame and use it to create a scatter plot where we plot average_price on the y-axis versus week on the x-axis. Assign your answer to an object called avocado_aggregate_plot. Don't forget to create proper English axis labels.

In [ ]:

#... <- ... %>%
#    ggplot(aes(x = ..., y = ...)) + 
#        ...() +
#        ...("...") + 
#        ...("...")

# your code here
fail() # No Answer - remove if you provide an answer
avocado_aggregate_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_that("week" %in% c(rlang::get_expr(avocado_aggregate_plot$mapping$x), rlang::get_expr(avocado_aggregate_plot$layers[[1]]$mapping$x)) , is_true())
    expect_that("average_price" %in% c(rlang::get_expr(avocado_aggregate_plot$mapping$y), rlang::get_expr(avocado_aggregate_plot$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(avocado_aggregate_plot$layers[[1]]$geom)) , is_true())
    })
print("Success!")

We can now see that Avocado price does indeed fluctuate throughout the year. And we could use this information to recommend to restaurants that if they want to maximize profit from menu items that contain avocados, they should only offer them on the menu roughly between December and May.

Why might this happen? Perhaps price has something to do with supply? We can also use this data set to get some insight into that question by plotting total avocado volume (y-axis) versus week. To do this, we will first have to create a column called total_avocado_volume whose value is the sum of the small, large and extra large-sized avocado volumes. To do this we will have to go back to the original avocado data frame we loaded.

Question 1.8

Our next step to plotting total_volume per week against week is to use mutate to create a new column in the avocado data frame called total_volume which is equal to the sum of all three volume columns:

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

In [ ]:

# avocado <- ... %>%
#     mutate(... = ... + ... + ...) 

# your code here
fail() # No Answer - remove if you provide an answer
head(avocado)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(avocado$total_volume))), '44199981c2b66aad74b02e3c1014001b')
     })
print("Success!")

Question 1.9

Now, create another reduced/aggregated version of the avocado data frame and name it reduced_avocado_2. To do this you will want to group_by the week column and then use summarize to calculate the average total volume (name that column total_volume).

In [ ]:

#... <- ... %>% 
#    group_by(...) %>% 
#    summarize(...)

# your code here
fail() # No Answer - remove if you provide an answer
head(avocado_aggregate_2)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(avocado_aggregate_2$total_volume))), '6c1676dd13517d0eed2be5e246dc8ef1')
    expect_equal(digest(as.numeric(sum(avocado_aggregate_2$week))), 'd27e825e408c446c586593f719b5545e')
     })
print("Success!")

Question 1.10

Now let's take the avocado_aggregate_2 data frame and use it to create a scatter plot where we plot average total_volume on the y-axis versus week on the x-axis. Assign your answer to an object called avocado_aggregate_plot_2. Don't forget to create proper English axis labels.

In [ ]:

#... <- ... %>%
#    ggplot(aes(x = ..., y = ...)) + 
#        ...() +
#        ...("...") + 
#        ...("...")

# your code here
fail() # No Answer - remove if you provide an answer
avocado_aggregate_plot_2

In [ ]:

test_that('Solution is incorrect', {
    expect_that("week" %in% c(rlang::get_expr(avocado_aggregate_plot_2$mapping$x), rlang::get_expr(avocado_aggregate_plot_2$layers[[1]]$mapping$x)) , is_true())
    expect_that("total_volume" %in% c(rlang::get_expr(avocado_aggregate_plot_2$mapping$y), rlang::get_expr(avocado_aggregate_plot_2$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(avocado_aggregate_plot_2$layers[[1]]$geom)) , is_true())
    })
print("Success!")

We can see from the above plot of the average total volume versus the week that there are more avocados sold (and perhaps this reflects what is available for sale) roughly between January to May. This time period of increased volume corresponds with the lower avocado prices. We can hypothesize then that the lower prices may be due to an increased availability of avocados during this time period.

2. Sea Surface Temperatures in Departure Bay

The next data set that we will be looking at contains environmental data from 1914 to 2018. The data was collected by the DFO (Canada's Department of Fisheries and Oceans) at the Pacific Biological Station (Departure Bay). Daily sea surface temperature and salinity observations have been carried out at several locations on the coast of British Columbia. The number of stations reporting at any given time has varied as sampling has been discontinued at some stations, and started or resumed at others.

Presently termed the British Columbia Shore Station Oceanographic Program (BCSOP), there are 12 participating stations; most of these are staffed by Fisheries and Oceans Canada. You can look at data from other stations at http://www.pac.dfo-mpo.gc.ca/science/oceans/data-donnees/lightstations-phares/index-eng.html

Further information from the Government of Canada's website indicates:

Observations are made daily using seawater collected in a bucket lowered into the surface water at or near the daytime high tide. This sampling method was designed long ago by Dr. John P. Tully and has not been changed in the interests of a homogeneous data set. This means, for example, that if an observer starts sampling one day at 6 a.m., and continues to sample at the daytime high tide on the second day the sample will be taken at about 06:50 the next day, 07:40 the day after etc. When the daytime high-tide gets close to 6 p.m. the observer will then begin again to sample early in the morning, and the cycle continues. Since there is a day/night variation in the sea surface temperatures the daily time series will show a signal that varies with the14-day tidal cycle. This artifact does not affect the monthly sea surface temperature data.

In this Worksheet, we want to see if the sea surface temperature has been changing over time.

Question 2.1 True or False:

The sampling of surface water occurs at the same time each day.

Assign your answer to an object called answer2.1.

In [ ]:

# Assign your answer to an object called: answer2.1
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.1), 'd2a90307aac5ae8d0ef58e2fe730d38b') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.2 Multiple Choice

If high tide occurred at 9am today, what time would the scientist collect data tomorrow?

A. 11:10 am

B. 9:50 am

C. 10:00 pm

D. Trick question... you skip days when collecting data.

Assign your answer to an object called answer2.2.

In [ ]:

# Assign your answer to an object called: answer2.2
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.2), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.3

To begin working with this data, read the file departure_bay_temperature.csv using a relative path. Note, this file (just like the avocado data set) is found within the worksheet_03 directory.

Assign your answer to an object called sea_surface.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(sea_surface)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(ncol(sea_surface), 13)
    expect_equal(nrow(sea_surface), 105)
    expect_equal(digest(as.numeric(sum(sea_surface$Year))), 'f56bfa7ed2b5549995da2c88407099a4')    
})
print("Success!")

Question 2.4

Given ggplot expects tidy data, we need to convert our data into that format. To do this we will use the gather() function. We would like our data to end up looking like this:

Year	Month	Temperature
1914	Jan	7.2
1914	Jan	7.2
1914	Jan	7.2
...	...	...
2016	Dec	5.5
2017	Dec	6.9
2018	Dec	NA

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Assign your answer to an object called tidy_temp.

In [ ]:

# ... <- sea_surface %>%
#     ...(key = '...', value = 'Temperature', -Year)

# your code here
fail() # No Answer - remove if you provide an answer
head(tidy_temp)
tail(tidy_temp)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(ncol(tidy_temp), 3)
    expect_equal(digest(as.numeric(sum(tidy_temp$Temperature, na.rm = TRUE))), '378da034470b139e7a84f82e2fcdcb08')    
})
print("Success!")

Question 2.5

Now that we have our data in a tidy format, we can create our plot that compares the average monthly sea surface temperatures to the year they were recorded. To make our plots more informative, we should plot each month separately. We can use filter to do this before we pipe our data into the ggplot function. Let's start out by just plotting the data for the month of November. As usual, use proper English to label your axes 😃

Assign your answer to an object called nov_temp_plot.

In [ ]:

options(repr.plot.width = 5, repr.plot.height = 2.5)
#... <- ... %>% 
#    filter(... == ...) %>% 
#    ggplot(aes(x = ..., y = ...)) + 
#    geom_point() + 
#    xlab(...) + 
#    ylab(...)


# your code here
fail() # No Answer - remove if you provide an answer
nov_temp_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_that("Year" %in% c(rlang::get_expr(nov_temp_plot$mapping$x), rlang::get_expr(nov_temp_plot$layers[[1]]$mapping$x)) , is_true())
    expect_that("Temperature" %in% c(rlang::get_expr(nov_temp_plot$mapping$y), rlang::get_expr(nov_temp_plot$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(nov_temp_plot$layers[[1]]$geom)) , is_true())
    })
print("Success!")

We can see that there may be a small decrease in colder temperatures in recent years, and/or the temperatures in recent years look less variable compared to years before 1975. What about other months? Let's plot them!

Instead of repeating the code above for the 11 other months, we'll take advantage of a ggplot2 function that we haven't met yet, facet_wrap. We will learn more about this function next week, this week we will give you the code for it.

Question 2.6

Fill in the missing code below to plot the average monthly sea surface temperatures to the year they were recorded for all months. Assign your answer to an object called all_temp_plot.

In [ ]:

options(repr.plot.width = 8, repr.plot.height = 8)
#... <- ... %>% 
#    ggplot(aes(x = ..., y = ...)) + 
#    geom_point() + 
#    facet_wrap(~ factor(Month, levels = c("Jan","Feb","Mar","Apr","May","Jun",
#                                          "Jul","Aug","Sep","Oct","Nov","Dec"))) +
#    xlab(...) + 
#    ylab(...)


# your code here
fail() # No Answer - remove if you provide an answer
all_temp_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_that("Year" %in% c(rlang::get_expr(all_temp_plot$mapping$x), rlang::get_expr(all_temp_plot$layers[[1]]$mapping$x)) , is_true())
    expect_that("Temperature" %in% c(rlang::get_expr(all_temp_plot$mapping$y), rlang::get_expr(all_temp_plot$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(all_temp_plot$layers[[1]]$geom)) , is_true())
    })
print("Success!")

We can see above that some months show a small, but general increase in temperatures, whereas others don't. And some months show a change in variability and others do not. From this it is clear to us that if we are trying to understand temperature changes over time, we best keep data from different months separate.

3. Pollution in Madrid

We're working with a data set from Kaggle once again! This data was collected under the instructions from Madrid's City Council and is publicly available on their website. In recent years, high levels of pollution during certain dry periods has forced the authorities to take measures against the use of cars and act as a reasoning to propose certain regulations. This data includes daily and hourly measurements of air quality from 2001 to 2008. Pollutants are categorized based on their chemical properties.

There are a number of stations set up around Madrid and each station's data frame contains all particle measurements that such station has registered from 01/2001 - 04/2008. Not every station has the same equipment, therefore each station can measure only a certain subset of particles. The complete list of possible measurements and their explanations are given by the website:

SO_2: sulphur dioxide level measured in μg/m³. High levels can produce irritation in the skin and membranes, and worsen asthma or heart diseases in sensitive groups.
CO: carbon monoxide level measured in mg/m³. Carbon monoxide poisoning involves headaches, dizziness and confusion in short exposures and can result in loss of consciousness, arrhythmias, seizures or even death.
NO_2: nitrogen dioxide level measured in μg/m³. Long-term exposure is a cause of chronic lung diseases, and are harmful for the vegetation.
PM10: particles smaller than 10 μm. Even though they cannot penetrate the alveolus, they can still penetrate through the lungs and affect other organs. Long term exposure can result in lung cancer and cardiovascular complications.
NOx: nitrous oxides level measured in μg/m³. Affect the human respiratory system worsening asthma or other diseases, and are responsible of the yellowish-brown color of photochemical smog.
O_3: ozone level measured in μg/m³. High levels can produce asthma, bronchytis or other chronic pulmonary diseases in sensitive groups or outdoor workers.
TOL: toluene (methylbenzene) level measured in μg/m³. Long-term exposure to this substance (present in tobacco smoke as well) can result in kidney complications or permanent brain damage.
BEN: benzene level measured in μg/m³. Benzene is a eye and skin irritant, and long exposures may result in several types of cancer, leukaemia and anaemias. Benzene is considered a group 1 carcinogenic to humans.
EBE: ethylbenzene level measured in μg/m³. Long term exposure can cause hearing or kidney problems and the IARC has concluded that long-term exposure can produce cancer.
MXY: m-xylene level measured in μg/m³. Xylenes can affect not only air but also water and soil, and a long exposure to high levels of xylenes can result in diseases affecting the liver, kidney and nervous system.
PXY: p-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
OXY: o-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
TCH: total hydrocarbons level measured in mg/m³. This group of substances can be responsible of different blood, immune system, liver, spleen, kidneys or lung diseases.
NMHC: non-methane hydrocarbons (volatile organic compounds) level measured in mg/m³. Long exposure to some of these substances can result in damage to the liver, kidney, and central nervous system. Some of them are suspected to cause cancer in humans.

The goal of this assignment is to see if pollutants are decreasing (is air quality improving) and also compare which pollutant has decreased the most over the span of 5 years (2001 - 2006).

First do a plot of one of the pollutants (CO).
Next, group it by month and year; calculate the maximum value and plot it (to see the trend through time).
Now we will look at which pollutant decreased the most. Repeat the same thing for every column - to speed up the process, use the map() function. First we will look at pollution in 2001 (get the maximum value for each of the pollutants). And then do the same for 2006.

Question 3.1 Multiple Choice:

What big picture question are we trying to answer?

A. Did CO decrease in Madrid between 2001 and 2006?

B. Of all the pollutants, which decreased the least between 2001 and 2006?

C. Of all the pollutants, which increased the most between 2001 and 2006?

D. Did CO increase in Madrid between 2001 and 2006?

Assign your answer to an object called answer3.1.

In [ ]:

# Assign your answer to an object called: answer3.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer3.1)), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.2

To begin working with this data, read the file madrid_pollution.csv. Note, this file (just like the avocado and sea surface data set) is found in the worksheet_03 directory.

Assign your answer to an object called madrid.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(madrid)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(ncol(madrid)), '20e70f4a08bdc6a54e53ad0a7d1498b6')
    expect_equal(digest(nrow(madrid)), '1d5acdfd45c7f7b957ba8f1f097a6d9e')
    expect_equal(digest(sum(madrid$BEN)), '9c9393e1464352cd4fbea94dfadfa02a')    
})
print("Success!")

Question 3.3

Now that the data is loaded in R, create a plot that compares carbon monoxide (CO) values against the date they were recorded. This graph will showcase the concentration of carbon monoxide in Madrid over time. As usual, label your axes:

x = Date
y = Carbon Monoxide

Assign your answer to an object called CO_pollution.

In [ ]:

options(repr.plot.width = 6, repr.plot.height = 3)
# Assign your plot to an object called: CO_pollution 

# your code here
fail() # No Answer - remove if you provide an answer
CO_pollution

# Are levels increasing or decreasing?

In [ ]:

test_that('Solution is incorrect', {
    expect_that("date" %in% c(rlang::get_expr(CO_pollution$mapping$x), rlang::get_expr(CO_pollution$layers[[1]]$mapping$x)) , is_true())
    expect_that("CO" %in% c(rlang::get_expr(CO_pollution$mapping$y), rlang::get_expr(CO_pollution$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(CO_pollution$layers[[1]]$geom)) , is_true())
    })
print("Success!")

We can see from this plot that over time, there are less and less high (> 5) CO values.

Question 3.4

The question above asks you to write out code that allows visualization of all CO recordings - which are taken every single hour of every day. Consequently the graph consists of many points and appears densely plotted. In this question, we are going to clean up the graph and focus on max carbon monoxide readings from each month. To further investigate if this trend is changing over time, we will use group_by and summarize to create a new data set.

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Assign your answer to an object called madrid_pollution.

In [ ]:

# ... <- ... %>%
#     group_by(year, ...) %>%
#     ...(max_co = max(CO, na.rm = TRUE))

# your code here
fail() # No Answer - remove if you provide an answer
head(madrid_pollution)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(ncol(madrid_pollution), 3)
    expect_equal(digest(as.numeric(sum(madrid_pollution$year))), '625a52cbf322663507323f452293484f')

})
print("Success!")

Question 3.5

Plot the new maximum CO values against the year they were recorded. Again, we will use facetting (this time with facet_grid, more on this next week) to plot each year side-by-side. We will also use the theme function to rotate the axis labels to make them more readable (more on this is coming next week too!).

Assign your answer to an object called madrid_plot. Remember to label your axes.

In [ ]:

options(repr.plot.width = 8, repr.plot.height = 3)
#... <- ... %>%
#    ggplot(aes(x = ..., y = ...)) + 
#    geom_point() +
#    xlab(...) + 
#    ylab(...) +
#    facet_grid(~ year) +
#    theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

# your code here
fail() # No Answer - remove if you provide an answer
madrid_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_that("month" %in% c(rlang::get_expr(madrid_plot$mapping$x), rlang::get_expr(madrid_plot$layers[[1]]$mapping$x)) , is_true())
    expect_that("max_co" %in% c(rlang::get_expr(madrid_plot$mapping$y), rlang::get_expr(madrid_plot$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(madrid_plot$layers[[1]]$geom)) , is_true())
    })
print("Success!")

Question 3.6

Now we want to see which of the pollutants has decreased the most. Therefore, we must repeat the same thing that we did in the questions above but for every pollutant (using the original data set)! This is where purrr's map* functions can be really helpful!

First we will look at Madrid pollution in 2001 (filter for this year). Next we have to select the columns that should be excluded (such as the date). Lastly, use the map_df() function to create max values for all columns.

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Assign your answer to an object called pollution_2001.

In [ ]:

# ... <- madrid %>%
#     ...(year == 2001) %>%
#     select(-..., -year, -month) %>%
#     map_df(~ ...(., na.rm  = TRUE))

# your code here
fail() # No Answer - remove if you provide an answer
pollution_2001

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(pollution_2001))), '7641b9c141801f7dfd1e36295531cd81')   
})
print("Success!")

Question 3.7

Now repeat what you did for Question 3.6, but filter for 2006 instead.

Assign your answer to an object called pollution_2006.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
pollution_2006

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(pollution_2006))), 'd5563a7b0d7d1a9eb53269ad69ed4e0b')   
})
print("Success!")

Question 3.8

Which pollutant decreased by the greatest magnitude between 2001 and 2006? Given that your the two objects you just created, pollution_2001 and pollution_2006 are data frames with the same columns you should be able to subtract the two objects to find which pollutant decreased by the greatest magnitude between the two years.

Assign your answer to an object called answer3.8. Make sure to write the answer exactly as it is given in the data set. Example:

answer3.8 <- "BEN"

In [ ]:

# In this cell, write your answer to Question 3.8
# Note: negative values in the cell above refer to increase in pollutant from 2001 to 2006
# Assign your answer to an object called: answer3.8
# Make sure the correct answer is uppercase and includes the appropriate symbols.
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer3.8)), '1ce38a3fa8946d5768f4fc87b739ec31') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 3.9

Given that there were only 14 columns in the data frame above, you could use your human eyeballs to pick out which pollutant decreased by the greatest magnitude between 2001 and 2006. But what would you do if you had 100 columns? Or 1000 columns? It would take A LONG TIME for your human eyeballs to find the biggest difference. Maybe you could use the min funcion:

In [ ]:

# run this cell
min(pollution_2006 - pollution_2001)

Which is a step in the right direction, but you get the value and not the column name... What are we to do? Tidy our data! Our data is not in tidy format, and so it's difficult to access the values for the variable pollutant because they are stuck as column headers. Let's use gather to tidy our data and make it look like this:

pollutant	value
BEN	-33.04
CO	-6.91
...	...

To answer this question, fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Assign your answer to an object called pollution_diff and ensure it has the same column names as the table pictured above.

In [ ]:

pollution_diff  <- pollution_2006 - pollution_2001
#pollution_diff  <- gather(..., key = ..., value = ..., everything())

# your code here
fail() # No Answer - remove if you provide an answer
head(pollution_diff)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(pollution_diff$value))), 'd4f4caf121e7237dc6328782e399d193') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(colnames(pollution_diff), c('pollutant', 'value'))
    expect_equal(nrow(pollution_diff), 14)
    expect_equal(ncol(pollution_diff), 2)
})
print("Success!")

Question 3.10

Now that you have tidy data, you can use arrange and desc to order the data in descending order. The output of that can be piped to head and head can be given the argument n = 1 to return only the first row of the data frame.

To answer this question, fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Assign your answer to an object called max_pollution_diff.

In [ ]:

#... <- ... %>% arrange(desc(...)) %>% 
#    head(n = 1)

# your code here
fail() # No Answer - remove if you provide an answer
max_pollution_diff

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(pollution_diff$value)), '8f7af6d08af08577fa68dfd6f3aa7f7d') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(colnames(max_pollution_diff), c('pollutant', 'value'))
    expect_equal(nrow(max_pollution_diff), 1)
    expect_equal(ncol(max_pollution_diff), 2)
})
print("Success!")

At the end of this data wrangling worksheet, we'll leave you with a couple quotes to ponder:

“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

source: Tidy data chapter from R for Data Science by Garrett Grolemund & Hadley Wickham

Worksheet 3: Cleaning and Wrangling Data

Lecture and Tutorial Learning Goals:

1. Assessing avocado prices to inform restaurant menu planning

2. Sea Surface Temperatures in Departure Bay

3. Pollution in Madrid

Product

Resources

Company