Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_04/worksheet_04.ipynb
2051 views
Kernel: R

Worksheet 4: Effective Data Visualization

Lecture and Tutorial Learning Goals:

Expand your data visualization knowledge and tool set beyond what we have seen and practiced so far. We will move beyond scatter plots and learn other effective ways to visualize data, as well as some general rules of thumb to follow when creating visualizations. All visualization tasks this week will be applied to real world data sets.

After completing this week's lecture and tutorial work, you will be able to:

  1. Define the three key aspects of ggplot objects:

    • aesthetic mappings

    • geometric objects

    • scales

  2. Use the ggplot2 function in R to create the following visualizations:

    • 2-D scatter plot

    • 2-D scatter plot with a third variable that stratifies the groups

    • count bar chart for multiple groups

    • proportion bar chart for multiple groups

    • stacked bar chart for multiple groups

  3. List the rules of thumb for effective visualizations

  4. Given a visualization and a sentence describing it’s intended task, evaluate it’s effectiveness and suggest ways to improve the visualization with respect to that intended task

This worksheet covers parts of Chapter 4 of the online textbook. You should read this chapter before attempting the worksheet.

### Run this cell before continuing. library(tidyverse) library(testthat) library(digest) library(repr)

Question 0.1 True or False:

Is colour an aesthetic?

Assign your answer to an object called answer0.1.

# Assign your answer to an object called: answer0.1 # Make sure the correct answer is written in lower-case (true / false) # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer0.1), '05ca18b596514af73f6880309a21b5dd') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 0.2 True or False:

When deciding on the size of your vizualization we recommend that you:

A. Only make the plot area (where the dots, lines, bars are) as big as needed

B. Make it as big as your screen allows

C. Use the default given by ggplot

Assign your answer to an object called answer0.2.

# Assign your answer to an object called: answer0.2 # Make sure the correct answer is written in lower-case (yes / no) # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer0.2), '75f1160e72554f4270c809f041c7a776') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 0.3 Under what circumstance would you use a 3D plot?

A. When you have 3 variables that you want to show the relationship between

B. When you want to emphasize the large difference between groups

C. When you need to grab attention of your audience

D. Never, we don't see well in 3D

Assign your answer to an object called answer0.3.

# Assign your answer to an object called: answer0.3 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer0.3), 'c1f86f7430df7ddb256980ea6a3b57a4') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 0.4 What is the symbol used to add a new layer to a ggplot object?

A. <-

B. %>%

C. %&%

D. +

Assign your answer to an object called answer0.4.

# Assign your answer to an object called: answer0.4 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer0.4), 'c1f86f7430df7ddb256980ea6a3b57a4') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Data scientists find work in all sectors of the economy and all types of organizations! Some work in collaboration with public sector organizations to solve problems that affect society, either at a local or at a global scale. Today we will be looking at a global problem with real data from the World Health Organization (WHO). According to WHO, Polio is a disease that affects mostly children younger than 5 years old and to date there is no cure. However, with vaccines the kids can develop sufficient antibodies in their system to be immune to the disease. Another disease, Hepatitis B, is also known to affect infants but in a chronic manner. There is also a vaccine for Hepatitis B available.

The columns in the dataset are:

  • who_region - The WHO region of the world

  • year - The year

  • pct_vaccinated - Estimated percentage of people vaccinated in the region

  • vaccine - Whether it's the polio or the hepatitis_b vaccine

We want to know two things. First, has there been a change in Polio or Hepatitis B vaccination patterns throughout the years? And if so, what is that pattern? Second, have the vaccination patterns for one of these dieseases changed more than the other? The goal for today is to produce a plot of the estimated percentage of people vaccinated per year. To do this, you will follow the steps outlined below.

The original datasets are available here:

The data set we will work with for this worksheet however is named world_vaccination.csv and it lives in the worksheet_04 directory.

Question 1.0

Consider the columns/variables year and pct_vaccinated? Are they:

A. both quantitative (e.g., numerical).

B. both categorical

C. one is categorical and one is quantitative

D. None of the above.

Assign your answer to an object called answer1.0.

# Assign your answer to an object called: answer1.0 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer1.0), '75f1160e72554f4270c809f041c7a776') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 1.1

Read the world_vaccination.csv file. Next, filter the data so that we don't have any NA's in our columns. We also want to filter out the (WHO) Global region as it is just the average of the other regions. Assign your filtered data to an object called world_vaccination.

When you want to filter something that isn't something, in R you can use the ! operator. For example to remove all the cars with 6 cylinders from the mtcars data set we would do the following:

head(mtcars) filter(mtcars, cyl != 6) %>% head()
#... <- read_...(...) %>% # filter(!is.na(...), # ... != ...) # your code here fail() # No Answer - remove if you provide an answer head(world_vaccination)
test_that('Solution is incorrect', { expect_equal(digest(as.numeric(sum(world_vaccination$year))), 'ed828cd9c4dbc736fb12a1a8643aaeec') }) print("Success!")

Question 1.2

Create a scatter plot of the year (y-axis) against the percentage of people vaccinated (x-axis) for all the regions in world_vaccination. Make sure to label your axes with human readable labels.

Assign your plot to an object called world_vacc_plot

# your code here fail() # No Answer - remove if you provide an answer world_vacc_plot
test_that('Solution is incorrect', { expect_that(digest(rlang::get_expr(world_vacc_plot$mapping$x)) == '707549965697b4525f5ab2b94106b632', is_true()) expect_that(digest(rlang::get_expr(world_vacc_plot$mapping$y)) == 'da16a50a39585b5f38e60840bb096ae2', is_true()) }) print("Success!")

Wow! That's a big plot! And unnecessarily so! It is so big that at 100% zoom I cannot even read the labels and see all of the data within my screen! Given that the "ink" (dots) in this plot are not very complex, let's take charge and reduce the plot size so that the plot area where the ink is, is only as big as it needs to be. We do this using the options function from the repr R package. In past worksheets and tutorials we did this for you, from now on you need to take charge and choose an approprate plot size!

The option function takes 2 arguments for plot size, repr.plot.width and repr.plot.height. Uncomment the first line of code in the code cell below and fill in the ... to choose a more approprate size and and replot the figure afterwards below.

#options(repr.plot.width = ..., repr.plot.height = ...) world_vacc_plot

Question 1.3 Multiple Choice

Now that we see how the percentage of people vaccinated with each of these vaccines varies over time, we should now start to look if there is a difference between the percentage vaccinated between the two different diseases. What should we do next to compare the differences (if they exist) most effectively?

A. Filter the data by the type of vaccine and make two separate plots

B. Colour the data by the type of vaccine

C. Have a different shaped "dot"/point for each type of vaccine

D. Colour the data by the type of vaccine, and have a different shaped "dot"/point for each type of vaccine

Assign your answer to an object called answer1.3.

# Assign your answer to an object called: answer1.3 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer1.3), 'c1f86f7430df7ddb256980ea6a3b57a4') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 1.5

Now that we know how we will separate the data for our visualization, let's do it. Copy your code from Question 1.4 and add an aes function inside of the geom_point function to map vaccine to colour and shape. So your geom_point function and layer should look something like this:

geom_point(aes(colour = ..., shape = ...)) +

Assign your answer to an object called compare_vacc_plot.

# your code here fail() # No Answer - remove if you provide an answer compare_vacc_plot
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(compare_vacc_plot$mapping$x)), '707549965697b4525f5ab2b94106b632') expect_equal(digest(rlang::get_expr(compare_vacc_plot$mapping$y)), 'da16a50a39585b5f38e60840bb096ae2') expect_equal(digest(rlang::get_expr(compare_vacc_plot$layers[[1]]$mapping)), 'b2d92f9e48bcf264f42ccd00f3202e95') expect_that('GeomPoint' %in% class(rlang::get_expr(compare_vacc_plot$layers[[1]]$geom)), is_true()) }) print("Success!")

Now that we see that although the dates where the percentage vaccinated became > 0 for each vaccine type started out in different years, they both increased at similar rates and are currently resting at about the same amount of percentage vaccinated. There is some variation that still exists in the data however, and perhaps that could be attributed to region? Let's create some more visualizations to see if that is indeed the case!

To get started, let's focus on the Polio vaccine data, and then we'll look at both together.

Question 1.6

Create a data frame object named polio that contains only the rows where the vaccine is "polio":

# your code here fail() # No Answer - remove if you provide an answer head(polio)
test_that('Solution is incorrect', { expect_equal(digest(as.numeric(sum(polio$pct_vaccinated))), 'cfd7ed9e50ed446d50289ff89ef338a4') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 1.7

Now create a scatter plot using the polio data where percentage vaccinated is on the y-axis, year is on the x-axis and each group has a different coloured point, and a different shape. Name it polio_regions. You might want to use options to change the plot size if the size of the last few plots isn't ideal for this plot.

#polio_regions <- ggplot(polio, aes(x = year, y = pct_vaccinated)) + # geom_point(aes(colour = who_region, shape = who_region)) + # xlab('Year') + # ylab('Percentage Vaccinated') # your code here fail() # No Answer - remove if you provide an answer polio_regions
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(polio_regions$mapping$x)), '707549965697b4525f5ab2b94106b632') expect_equal(digest(rlang::get_expr(polio_regions$mapping$y)), 'da16a50a39585b5f38e60840bb096ae2') expect_equal(digest(rlang::get_expr(polio_regions$layers[[1]]$mapping)), '48b131db0961401bf69e069def79a8b9') expect_that('GeomPoint' %in% class(rlang::get_expr(polio_regions$layers[[1]]$geom)), is_true()) }) print("Success!")

Question 1.7

Although when we have multiple groups, its easier for us to see the differences when we change point colour and shape, at some point there are too many groups to keep things straight. We are approaching that on the plot above, and so we need to do something different... One thing we could try is to change the point to a line to reduce the noise/chaos of the plot above. We would also not have a shape. Do that in the cell below and name the plot object polio_regions_line.

#polio_regions <- ggplot(polio, aes(x = year, y = pct_vaccinated)) + # geom_point(aes(colour = who_region, shape = who_region)) + # xlab('Year') + # ylab('Percentage Vaccinated') # your code here fail() # No Answer - remove if you provide an answer polio_regions_line
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(polio_regions_line$mapping$x)), '707549965697b4525f5ab2b94106b632') expect_equal(digest(rlang::get_expr(polio_regions_line$mapping$y)), 'da16a50a39585b5f38e60840bb096ae2') expect_equal(digest(rlang::get_expr(polio_regions_line$layers[[1]]$mapping)), '0f44763313cf0a1aa041b6f8a0cb53c1') expect_that('GeomLine' %in% class(rlang::get_expr(polio_regions_line$layers[[1]]$geom)), is_true()) }) print("Success!")

One thing that is still not ideal with the visualization above is the not very readable legend title. Let's add another layer to polio_regions_line to do that. To do this we use the labs function and choose the aesthetic mapping (here color) that we want to apply the legend title to. Also, given that we created an object from our previous plot, we do not need to retype all our code, but instead can just say:

PLOT_OBJECT <- PLOT_OBJECT + NEW_LAYER
#polio_regions_line <- polio_regions_line + # labs(... = "Region of the world") # your code here fail() # No Answer - remove if you provide an answer polio_regions_line
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(polio_regions_line$mapping$x)), '707549965697b4525f5ab2b94106b632') expect_equal(digest(rlang::get_expr(polio_regions_line$mapping$y)), 'da16a50a39585b5f38e60840bb096ae2') expect_equal(digest(rlang::get_expr(polio_regions_line$layers[[1]]$mapping)), '0f44763313cf0a1aa041b6f8a0cb53c1') expect_that('GeomLine' %in% class(rlang::get_expr(polio_regions_line$layers[[1]]$geom)), is_true()) expect_equal(digest(polio_regions_line$labels$colour), 'f656da07b6f9d2efec6e3334865f25ab') }) print("Success!")

Question 1.9

Now that we know how to effectively plot the percentage vaccinated against Polio over time for each region, how might we compare this to what we see for each region for the percentage vaccinated against Hepatitis B? In this case we would like two side-by-side or two vertically arranged plots. If that data are in the same data frame (as ours were in the world_vaccination data frame) then we can use a technique called facetting to do this. We saw facetting last week, and now we will take some time to learn it.

There are two facetting functions in R, the one we will see here is facet_wrap. The basic syntax for this ggplot layer is the following:

# creates side by side plots for each member of the category in COLUMN_X facet_grid(~ COLUMN_X)

or

# creates vertically arranged plots for each member of the category in COLUMN_X facet_grid(COLUMN_X ~ .)

Create a plot like the one named polio_regions_line but instead of using the polio data frame, use the world_vaccination data frame, and facet on the column vaccine so that the two plots are side-by-side. Name this plot object side_by_side_world. Make sure you set options to make the plot size fit well within the worksheet.

#... <- ... + # geom_...(...) + # labs(...) + # xlab(...) + # ylab(...) + # facet_grid(...) # your code here fail() # No Answer - remove if you provide an answer side_by_side_world
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(side_by_side_world$mapping$x)), '707549965697b4525f5ab2b94106b632') expect_equal(digest(rlang::get_expr(side_by_side_world$mapping$y)), 'da16a50a39585b5f38e60840bb096ae2') expect_equal(digest(rlang::get_expr(side_by_side_world$layers[[1]]$mapping)), '866b1a2ea42bbb1b66225f4245e2bd94') expect_that('GeomLine' %in% class(rlang::get_expr(side_by_side_world$layers[[1]]$geom)), is_true()) expect_equal(digest(side_by_side_world$labels$colour), 'f656da07b6f9d2efec6e3334865f25ab') expect_that('FacetGrid' %in% class(rlang::get_expr(side_by_side_world$facet)), is_true()) }) print("Success!")

Question 1.9.1

Now use facet_grid to arrange the same two plots vertically. Name this plot vertical_world. Again, make sure you set options to create a suitable plot size.

# your code here fail() # No Answer - remove if you provide an answer vertical_world
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(vertical_world$mapping$x)), '707549965697b4525f5ab2b94106b632') expect_equal(digest(rlang::get_expr(vertical_world$mapping$y)), 'da16a50a39585b5f38e60840bb096ae2') expect_equal(digest(rlang::get_expr(vertical_world$layers[[1]]$mapping)), '866b1a2ea42bbb1b66225f4245e2bd94') expect_that('GeomLine' %in% class(rlang::get_expr(vertical_world$layers[[1]]$geom)), is_true()) expect_equal(digest(vertical_world$labels$colour), 'f656da07b6f9d2efec6e3334865f25ab') expect_that('FacetGrid' %in% class(rlang::get_expr(vertical_world$facet)), is_true()) }) print("Success!")

Which arrangement is better? Depends on what you are asking! If you are interested in comparing the rate at which things changed over time, then the vertical arrangement is more effective. However, if you are interested in comparing the exact percentage values between the lines at certain points then the side-by-side arrangement is more effective.

Question 1.9.2 Multiple Choice

Which WHO region had the greatest progress in the shortest period of time in either Hepatitis B and in Polio (using the data we plotted above)?

A. Americas

B. Eastern Mediterranean

C. Europe

D. Western Pacific

Assign your answer to an object called answer1.9.2.

# Assign your answer to an object called: answer1.9.2 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer1.9.2), 'c1f86f7430df7ddb256980ea6a3b57a4') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

2. Fast-Food Chains in the United States

With their cheap meals and convenient drive-thrus, fast food restaurants are a growing demand in many countries. Despite their questionable ingredients and nutritional value, most Americans count on fast food in their daily lives (they are often delicious and so hard to resist...).

Source: https://media.giphy.com/media/NS6SKs3Lt8cPHhe0es/giphy.gif

According to Wikipedia,

Fast food was originally created as a commercial strategy to accommodate the larger numbers of busy commuters, travelers and wage workers who often didn't have the time to sit down at a public house or diner and wait the normal way for their food to be cooked. By making speed of service the priority, this ensured that customers with strictly limited time (a commuter stopping to procure dinner to bring home to their family, for example, or an hourly laborer on a short lunch break) were not inconvenienced by waiting for their food to be cooked on-the-spot (as is expected from a traditional "sit down" restaurant). For those with no time to spare, fast food became a multi-billion dollar industry.

Currently, fast food is the norm and lots of businesses are investing in advertisement as well as new ideas to make their chain stand out in the sea of restaurants. In fact, one business is hiring you. They want to know the layout of the landscape:

  1. Which is the most popular franchise on the West-Coast?

  2. Which state has the highest number of fast-food restuarants?

  3. Is the most dominant franchise consistent across the West-Coast?

In this assignment, you will pretend to assist in the opening of a new restaurant somewhere on the West Coast of the United States (California, Oregon, or Washington). Your goal is to figure out which chain to recommend and figure out which state would be the least competitive.

Question 2.1

From the list below, what are you not trying to determine:

A. The West-Coast frachise with the greatest popularity

B. Is the least dominant franchise consistent across the West-Coast

C. The state on the West-Coast with the greatest number of fast-food restuarants

Assign your answer to an object called answer 2.1.

# Assign your answer to an object called: answer2.1 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer2.1), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

Question 2.2

Read the fast_food.csv file (found in the worksheet_04 directory) and assign it to an object called fast_food.

# your code here fail() # No Answer - remove if you provide an answer head(fast_food)
test_that('Solution is incorrect', { expect_equal(digest(as.character(fast_food[[3,1]])), '2e716500dfeb89b1b087089a5b1355f8') # we hid the answer to the test here so you can't see it, but we can still run the test expect_equal(digest(as.character(fast_food[[4,2]])), 'd599245d7d7e3f56863ba3a6112ca71b') }) print("Success!")

Question 2.3

Next, find the top 9 restuarants on the West Coast (in the states "CA", "WA" or "OR") and name them top_restaurants

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Assign your answer to an object called top_restaurants.

#top_restaurants <- fast_food %>% # filter(state %in% c("CA", "WA", "OR")) %>% # group_by(...) %>% # ...(n = n()) %>% # ...(...) %>% # ...(...) # your code here fail() # No Answer - remove if you provide an answer top_restaurants
test_that('Solution is incorrect', { expect_equal(ncol(top_restaurants), 2) expect_equal(nrow(top_restaurants), 9) expect_equal(digest(sum(as.numeric(top_restaurants$n, na.rm = TRUE))), '4745bd90664c8cb9935cfb1a4cf51d77') }) print("Success!")

Question 2.4

Plot the counts for the top 9 fast food restaurants on the West Coast as a bar chart using geom_bar. The number of restaurants should be on the y-axis and the restaurant names should be on the x-axis. Because we are not counting up the number of rows in our data frame, but instead are plotting the actual values in the n column, we need to use the stat = "identity" argument inside geom_bar.

To do this fill in the ... in the cell below. Copy and paste your finished answer into the fail(). Make sure to label your axes and choose an appropriate figure size using options.

Assign your answer to an object called count_bar_chart.

#... <- ...(..., aes(x = ..., y = ...)) + # geom_bar(...) + # xlab(...) + # ylab(...) # your code here fail() # No Answer - remove if you provide an answer count_bar_chart
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(count_bar_chart$mapping$x)), 'b009e2c5c340ac154e889c878f9d27de') expect_equal(digest(rlang::get_expr(count_bar_chart$mapping$y)), 'd731df15d5f885e2ddffa01f617515bd') expect_equal(digest(rlang::get_expr(count_bar_chart$layers[[1]]$mapping)), 'f9e884084b84794d762a535f3facec85') expect_that('GeomBar' %in% class(rlang::get_expr(count_bar_chart$layers[[1]]$geom)), is_true()) }) print("Success!")

Question 2.5

The x-axes labels don't look great unless you make the bars on the bar plot above quite wide, wider than are actually useful or effective. What can we do? Well we can add a theme layer and rotate the labels! Choose an angle that you think is appropriate. Choose something between 20 and 90 for the angle argument. Use the hjust = 1 argument to ensure your labels don't sit on top of the bars as you rotate them (try removing that argument and see what happens...)

#count_bar_chart <- ... + # theme(axis.text.x = element_text(angle = ..., hjust = 1)) # your code here fail() # No Answer - remove if you provide an answer count_bar_chart
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(count_bar_chart$mapping$x)), 'b009e2c5c340ac154e889c878f9d27de') expect_equal(digest(rlang::get_expr(count_bar_chart$mapping$y)), 'd731df15d5f885e2ddffa01f617515bd') expect_equal(digest(rlang::get_expr(count_bar_chart$layers[[1]]$mapping)), 'f9e884084b84794d762a535f3facec85') expect_that('GeomBar' %in% class(rlang::get_expr(count_bar_chart$layers[[1]]$geom)), is_true()) expect_equal(digest(as.numeric(count_bar_chart$theme$axis.text.x$hjust)), '6717f2823d3202449301145073ab8719') expect_that(count_bar_chart$theme$axis.text.x$angle <= 90 & count_bar_chart$theme$axis.text.x$angle >= 20, is_true()) }) print("Success!")

Question 2.6

Which is the most popular franchise on the West-Coast? Save your answer as answer2.6 and be sure to surround the restuarant name with quotations. Pay attention to case and punctuation when answering.

# your code here fail() # No Answer - remove if you provide an answer answer2.6
test_that('Solution is incorrect', { expect_equal(digest(answer2.6), '948a9b527842ee791d4f18fb5594fbf7') }) print("Success!")

Question 2.7

Next, let's answer the question, The state on the West-Coast with the greatest number of fast-food restuarants? To do this we need to use the names in top_restaurants to get the counts of each restaurant in each of the 3 states from the fast_food data frame. You will need to use semi_join to do this (to get the intersection of two data frames). Name this data frame state_counts. Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

If you are interested in learning more about joining data frames in R, see this cheatsheet.

#... <- fast_food %>% # semi_join(top_restaurants) %>% # semi_join gives the intersection of two data frames # filter(state %in% c("CA", "WA", "OR")) %>% # ...(...) %>% # ...(n = n()) # your code here fail() # No Answer - remove if you provide an answer head(state_counts)
test_that('Solution is incorrect', { expect_equal(ncol(state_counts), 2) expect_equal(nrow(state_counts), 3) expect_equal(digest(sum(as.numeric(state_counts$n, na.rm = TRUE))), '4745bd90664c8cb9935cfb1a4cf51d77') }) print("Success!")

Question 2.8

Now, create a bar plot that has restaurant count on the y-axis and US state on the x-axis. Name the plot state_counts_plot. Remember to choose an appropriate plot size using options.

# your code here fail() # No Answer - remove if you provide an answer state_counts_plot
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(state_counts_plot$mapping$x)), 'a47fba8c92b6a5ffd5f9caf0907b9977') expect_equal(digest(rlang::get_expr(state_counts_plot$mapping$y)), 'd731df15d5f885e2ddffa01f617515bd') expect_equal(digest(rlang::get_expr(state_counts_plot$layers[[1]]$mapping)), 'f9e884084b84794d762a535f3facec85') expect_that('GeomBar' %in% class(rlang::get_expr(state_counts_plot$layers[[1]]$geom)), is_true()) }) print("Success!")

Question 2.9

Which state has the highest number of fast-food restuarants? Save your answer as answer2.9 and be sure to surround the restuarant name with quotations. Pay attention to case and punctuation when answering.

# your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer2.9), '2bedd54d48692762c119b27f5ec7a320') }) print("Success!")

Consider the populations of California (325.7 million), Oregon (4.143 million) and Washington (7.406 million) (source: United States Census Bureau). Discuss with your neighbour about whether using the raw restaurant count for each state is the best measure of competition? Would restaurant per capita be a better alternative?

Question 2.9.1

Is the most dominant/top franchise consistent across the West-Coast? To answer this question we need a data frame that has three columns: name (restaurant), state and n (restaurant count). We will need to use the semi-join strategy as we did above to use the names in top_restaurants to get the counts of each restaurant in each of the 3 states from the fast_food data frame. This time, however, we'll need to group_by both name and state. Name this new data frame top_n_state.

#... <- fast_food %>% # semi_join(top_restaurants) %>% # semi_join gives the intersection of two data frames # filter(state %in% c("CA", "WA", "OR")) %>% # ...(..., ...) %>% # ...(...) # your code here fail() # No Answer - remove if you provide an answer head(top_n_state)
test_that('Solution is incorrect', { expect_equal(ncol(top_n_state), 3) expect_equal(nrow(top_n_state), 27) expect_equal(digest(sum(as.numeric(top_n_state$n, na.rm = TRUE))), '4745bd90664c8cb9935cfb1a4cf51d77') }) print("Success!")

Question 2.9.2

Plot the counts (y-axis) for the top 9 fast food restaurants (x-axis) on the West Coast, per US State (group), as a bar chart using geom_bar. Use fill = name inside aes to color the restaurant's by name. Use position = "dodge" inside geom_bar to group the bars by state. To rename the legend, use a labs layer. This time within labs use the fill argument instead of color (this is because you need to modify the asthetic that the legend was made from, here it was fill, not color as earlier in the worksheet).

To do this fill in the ... in the cell below. Copy and paste your finished answer into the fail(). Make sure to label your axes and choose an appropriate plot size.

Assign your answer to an object called top_n_state_plot.

#... <- ggplot(..., aes(x = state, y = n, fill = ...)) + # ...(stat = ..., position = "...") # xlab(...) + # ylab(...) + # labs(fill = "Restaurant") # your code here fail() # No Answer - remove if you provide an answer top_n_state_plot
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(top_n_state_plot$mapping$x)), 'a47fba8c92b6a5ffd5f9caf0907b9977') expect_equal(digest(rlang::get_expr(top_n_state_plot$mapping$y)), 'd731df15d5f885e2ddffa01f617515bd') expect_equal(digest(rlang::get_expr(top_n_state_plot$layers[[1]]$mapping)), 'f9e884084b84794d762a535f3facec85') expect_that('GeomBar' %in% class(rlang::get_expr(top_n_state_plot$layers[[1]]$geom)), is_true()) expect_equal(digest(top_n_state_plot$labels$fill), '8b8460e5db816b64e2aefd1c37d67c23') }) print("Success!")

How easy is that to compare the restaurants and states to answer our question: Is the most dominant/top franchise consistent across the West-Coast? If we carefully look at this plot we can pick answer this question, but it takes us a while to process this. If we instead visualize this as a stacked bar chart using proportions instead of counts we might be able to do this easier (making it a more effective visualization).

Question 2.9.3

Copy your code from Question 2.9.2 and modify position = "dodge" to position = "fill" to change from doing a grouped bar chart to a stacked bar chart with the data represented as proportions instead of counts.

# your code here fail() # No Answer - remove if you provide an answer top_n_state_plot
test_that('Solution is incorrect', { expect_equal(digest(rlang::get_expr(top_n_state_plot$mapping$x)), 'a47fba8c92b6a5ffd5f9caf0907b9977') expect_equal(digest(rlang::get_expr(top_n_state_plot$mapping$y)), 'd731df15d5f885e2ddffa01f617515bd') expect_equal(digest(rlang::get_expr(top_n_state_plot$layers[[1]]$mapping)), 'f9e884084b84794d762a535f3facec85') expect_that('GeomBar' %in% class(rlang::get_expr(top_n_state_plot$layers[[1]]$geom)), is_true()) expect_equal(digest(top_n_state_plot$labels$fill), '8b8460e5db816b64e2aefd1c37d67c23') }) print("Success!")

Question 2.9.4 Multiple Choice

Is the most dominant franchise consistent across the West-Coast? Answer "yes" or "no". Save your answer as answer2.9.4 and be sure to surround the answer with quotations. Pay attention to case and punctuation when answering.

# Assign your answer to an object called: answer2.9.4 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is incorrect', { expect_equal(digest(answer2.9.4), '0590b0427c1b19a6eb612d19888aa52f') # we hid the answer to the test here so you can't see it, but we can still run the test }) print("Success!")

We are just scratching the surface of how to create effective visualizations in R. For example, we haven't covered how to changed from the default colors pallete ggplot2 provides. We'll learn more in the tutorial, however, it's a big world out there, and to learn more read the two chapters we pointed to in the reading and practice, practice, practice! Go forth and make beautiful and effective plots!