GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/tutorial_03/tutorial_03.ipynb
²⁶⁸² views

Kernel: R

Tutorial 3: Cleaning and Wrangling Data

Any place you see ..., you must fill in the function, variable, or data to complete the code. Replace fail() with your completed code and run the cell!

In [ ]:

### Run this cell before continuing. 
library(repr)
library(tidyverse)
library(testthat)
library(digest)

Revision Question Match the following definitions with the corresponding functions used in R:

Definitions

A. Transforms the input by applying a function to each element and returning a vector the same length as the input.

B. Reads files that have columns separated by tabs.

C. Most data operations are done on groups defined by variables. This function takes an existing data set and converts it into a grouped data set where operations are performed "by group".

D. Works in an analogous way to mutate, except instead of adding columns to an existing data frame, it creates a new data frame.

E. Takes multiple columns and collapses them into key-value pairs, duplicating all other columns as needed. You use it when you notice that you have columns that are not variables.

F. Labels the x-axis.

Functions

group_by
map
read_tsv
summarise
xlab
gather

For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list of functions. For example: B <- 1

In [ ]:

# Assign your answer to a letter: A, B, C, D, E, F
# Make sure the correct answer is a numerical number from 1-6 
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(A), 'db8e490a925a60e62212cefc7674ca02') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(B), 'e5b57f323c7b3719bbaaf9f96b260d39') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(C), '6717f2823d3202449301145073ab8719') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(D), 'dbc09cba9fe2583fb01d63c70e1555a8') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(E), '0aee9b78301d7ec8998971363be87c03') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(F), '5e338704a8e069ebd8b38ca71991cf94') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

1. Historical Data on Avocado Prices

In the tutorial, we will be finishing off our analysis of the avocado data set.

You might recall from the lecture that millennials LOVE avocado toast. However, avocados are expensive and this is costing millennials a lot more than you think (joking again 😉, well mostly...). To ensure that they can save enough to buy a house, it would be beneficial for an avocado fanatic to move to a city with low avocado prices. From Worksheet 3 we saw that the price of the avocados is less in the months between December and May, but we still don't know which region contains the cheapest avocados.

image source: https://media.giphy.com/media/8p3ylHVA2ZOIo/giphy.gif

As a reminder, here are some relevant columns in the dataset:

average_price - The average price of a single avocado.
type - conventional or organic
year - The year
region - The city or region of the observation
small_hass_volume
large_hass_volume
extra_l_hass_volume

The last three columns will create total_volume. The goal for today is to find the region with the cheapest avocados and then produce a plot of the total number of avocados sold against the average price per avocado (in US dollars) for that region. To do this, you will follow the steps below.

use read_csv to download the csv file onto your notebook
use group_by + summarize to find the region with the cheapest avocados.
use filter to specifically look at data from the region of interest.
use mutate to add up the volume for all types of avocados (small, large, and extra)
use ggplot to create our plot of volume vs average price

Question 1.1

Read the file avocado_prices.csv found in the tutorial_03 directory using a relative path.

Assign your answer to an object called avocado.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(avocado)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(ncol(avocado), 9)
    expect_equal(nrow(avocado), 17911)
    expect_equal(digest(sum(avocado$average_price)), 'b987e782cc203bbe06ef8fde8da42991')    
})
print("Success!")

Question 1.2

Second step: Find the region with the cheapest avocados in 2018. To do this calculate the average price for each region. Your answer should be the row from a data frame with the lowest average price. The data frame you create should have two columns, one named region that has the region, and the other that contains the average price for that region.

Assign your answer to an object called cheapest.

In [ ]:

# cheapest <- ... %>%
#    ... %>%
#    group_by(...) %>%
#    summarize(...) %>%
#    arrange(...) %>% 
#    head(...)

# your code here
fail() # No Answer - remove if you provide an answer
cheapest

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(cheapest), 1)
    expect_equal(ncol(cheapest), 2)
    expect_equal(digest(cheapest$region[0]), '5152ac13bdd09110d9ee9c169a3d9237') # we hid the answer to the test here so you can't see it, but we can still run the test
    expect_equal(digest(as.numeric(unlist(select(cheapest, -region)))), '481ea83b5704f345de5f42f139ee11c7')
})
print("Success!")

Question 1.3

The next step to plotting total volume against average price for the cheapest region is to filter the avocado data frame for the region found in Question1.2. Then you need to mutate the data frame such that total_volume is equal to the addition of all three volume columns. Finally, you will have the data so that you can create a scatter plot with:

x = total_volume
y = average_price

Fill in the ... in the cell below. Copy and paste your finished answer into the fail(). We will be scaling the axes - the function is added in the scaffolding for you.

Assign your answer to an object called avocado_plot.

In [ ]:

options(repr.plot.width = 5, repr.plot.height = 2.5)

# avocado_plot <- ... %>%
#     mutate(...) %>%
#     filter(...) %>%
#     ggplot(aes(x = ..., y = ...)) + 
#         ... + 
#         xlab(...) +
#         ylab(...) +
#         scale_x_log10()

# your code here
fail() # No Answer - remove if you provide an answer
avocado_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_that("total_volume" %in% c(rlang::get_expr(avocado_plot$mapping$x), rlang::get_expr(avocado_plot$layers[[1]]$mapping$x)) , is_true())
    expect_that("average_price" %in% c(rlang::get_expr(avocado_plot$mapping$y), rlang::get_expr(avocado_plot$layers[[1]]$mapping$y)) , is_true())
    expect_that("GeomPoint" %in% c(class(avocado_plot$layers[[1]]$geom)) , is_true())
    })
print("Success!")

Question 1.4

What do you notice? Discuss your plot with the person next to you.

To further investigate this trend, let's colour the data points to see if the type of avocado (either organic or not, which is called conventional in this data set) affects the volume and price of avocados sold in our region of interest.

Run the cell below to colour the data points by avocado type.

In [ ]:

# Run this cell to see if type plays a role in production and price. 
avocado_plot <- avocado_plot + 
    geom_point(aes(colour = type))
avocado_plot

Question 1.4 Continued

In 2-3 sentences, describe what you seen in the graph above. Comment specifically on whether there is any evidence/indication that avocado type might influence price?

Hint: Make sure to include information about volume, average price, and type in your answer.

YOUR ANSWER HERE

2. Historical Data on Avocado Prices (Continued)

Question 2.1

Now that we know the region that sells the cheapest avocados (on average), which region sells the most expensive avocados (on average). And for that region, what role might avocado type play in sales? Repeat the analysis you did above, but now apply it to investigate the region which sells the most expensive avocados (on average).

In [ ]:


# your code here
fail() # No Answer - remove if you provide an answer

Question 2.2

In 2-3 sentences, describe what you seen in the graph above for the region with the most expensive avocados (on average). Comment specifically on whether there is any evidence/indication that avocado type might influence price?

Hint: Make sure to include information about volume, average price, and type in your answer.

YOUR ANSWER HERE

Question 2.3

Plot the scatterplots for the two regions so that they are in adjacent cells (so it is easier for you to compare them). Compare the price and volume data across the two regions. And argue for or against the hypothesis that the region that has the cheapest avocados, has the cheapest avocados because it sells less of the organic (expensive) type of avocados compared to conventional cheaper ones.

YOUR ANSWER HERE

3. Sea Surface Salinity in Departure Bay

As mentioned in this week's Worksheet, Canada's Department of Fisheries and Oceans (DFO) compiled environmentally essential data from 1914 to 2018. The data was collected at the Pacific Biological Station (Departure Bay). Daily sea surface temperature and salinity observations have been carried out at several locations on the coast of British Columbia. The number of stations reporting at any given time has varied as sampling has been discontinued at some stations, and started or resumed at others.

In Worksheet 3 we already worked with the temperature observations. Now, we will be focussing on salinity! Specifically, we want to see if the maximum salinity of the month has been changing over the years. We will only be focussing our attention on the winter months December, January and February.

Question 3.1

To begin working with this data, read the file max_salinity.csv into R. Note, this file (just like the avocado data set) is found within the tutorial_03 folder.

Assign your answer to an object called sea_surface.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(sea_surface)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(sum(sea_surface$Dec)), '9c9393e1464352cd4fbea94dfadfa02a')
    expect_equal(digest(as.numeric(ncol(sea_surface))), '709a7faa72cb6f3be79d683e234ccb25')    
})
print("Success!")

Question 3.2

Given that ggplot prefers tidy data, we must tidy the data! Use the gather() function to create a tidy data frame with three columns: Year, Month and Salinity. Remember we only want to look at the summer months (December, January and February) so don't forget to reduce the data to just those months!

Assign your answer to an object called max_salinity.

In [ ]:

# ... <- sea_surface %>%
#     ...(...) %>%
#     ...(key = '...', value = 'Salinity', -Year)

# your code here
fail() # No Answer - remove if you provide an answer
head(max_salinity)
tail(max_salinity)

Question 3.3

Now that we've created new columns, we can finally create our plot that compares the maximum salinity observations to the year they were recorded. As usual, label your axes!

Assign your answer to an object called max_salinity_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
max_salinity_plot

Question 3.4

In 1-2 sentences, describe what you seen in the graph above. Comment specifically on whether there is a change in salinity across time for the winter months.

YOUR ANSWER HERE

4. Pollution in Madrid

The goal of this analysis (which we started in worksheet_03) is to see if pollutants are decreasing (is air quality improving) and also determine which pollutant has decreased the most over the span of 5 years (2001 - 2006). In worksheet_03 we investigated what happened with the maximum values of each pollutant over time, now we will investigate the average values of each pollutant over time. To do this we will:

Calculate the average monthly value for each pollutant for each year.
Create a scatter plot for the average monthly value for each month. Plot these values for each pollutant and each year so that a trend over time for each pollutant can be observed.
Now we will look at which pollutant decreased the most between 2001 - 2006 when we look at the average instead of the maximum values.

Question 4.1

To begin working with this data, read the file madrid_pollution.csv. Note, this file (just like the other data sets in this tutorial) is found in the tutorial_03 directory.

Assign your answer to an object called madrid.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(madrid)

Given that we are going to plotting months, which are dates, let's tell R how they should be ordered. We can do this by changing the month column from a character vector to a factor vector. Factors in R are useful for categorical data and they have an order.

In [ ]:

# run the cell to order the column month by month (date) and not alphabetically
madrid <- madrid %>% 
    mutate(month = factor(month, levels = c('January','February','March','April',
                                            'May','June','July','August',
                                            'September','October','November','December')))

Question 4.2

Calculate the average monthly value for each pollutant for each year and store that as a data frame. Your data frame should have the following 4 columns:

year
month
pollutant
monthly_avg

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 4.3

Create a scatter plot for the average monthly value for each month. Plot these values for each pollutant and each year so that a trend over time for each pollutant can be observed. To do this all in one plot, you are going to want to use a facet_grid layer (makes subplots within one plot when data are "related") and a theme layer (to adjust the angle of the text on the x-axis. We provide you with the code for these two layers in the scaffolding for this plot.

In [ ]:

options(repr.plot.width = 9, repr.plot.height = 16)
#... <- ... %>%
#    ggplot(aes(x = ..., y = ...)) + 
#    geom_point() +
#    xlab(...) + 
#    ylab(...) +
#    facet_grid(pollutant ~ year, scales = "free") +
#    theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

# your code here
fail() # No Answer - remove if you provide an answer
madrid_avg_plot

Question 4.4

By looking at the plots above, which monthly average pollutant levels appear to have decreased over time? Which appear to have increased?

YOUR ANSWER HERE

Question 4.5

Now we will look at which pollutant decreased the most between 2001 - 2006 when we look at the average yearly values for each pollutant. Your final result should be a data frame that has at least these two columns: pollutant and yearly_avg_diff and one row (the most decreased pollutant when looking at yearly average between 2001 - 2006).

There are several different ways to solve this problem. My solution included using a function called spread which is the inverse of gather. If you would like to use that function, see here for more info.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 4.6

Did using the average to find the most decreased pollutant between 2001 and 2006 give you the same answer as using the maximum in the worksheet? Is your answer to the previous question surprising? Explain.

YOUR ANSWER HERE

** Optional question** (for fun not and does not count for grades):

Consider doing the same analysis Question 4.5, except this time calculate the difference as a percent or fold differen (as opposed to absolute difference as we did in Question 4.5). The scales for the pollutants are very different, and so we might want to take this into consideration when trying to answer the question "Which pollutant decreased the most"?

Tutorial 3: Cleaning and Wrangling Data

1. Historical Data on Avocado Prices

2. Historical Data on Avocado Prices (Continued)

3. Sea Surface Salinity in Departure Bay

4. Pollution in Madrid

Product

Resources

Company