GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/materials/tutorial_04/tutorial_04.ipynb
²⁶⁹² views

Kernel: R

Tutorial 4: Effective Data Visualization

Any place you see ..., you must fill in the function, variable, or data to complete the code. Replace fail() with your completed code and run the cell!

In [ ]:

### Run this cell before continuing. 

library(tidyverse)
library(repr)
source("tests_tutorial_04.R")

Question 0.1
{points: 1}

Match the following definitions with the corresponding aesthetic mapping or function used in R:

Definitions

A. Prevents a chart from being stacked. It preserves the vertical position of a plot while adjusting the horizontal position.

B. In bar charts, this aesthetic fills in the bars by a specific colour or separates the counts by a variable different from the x-axis.

C. In bar charts, it outlines the bars but in scatterplots, it fills in the points (colouring them based on a particular variable aside from the x/y-axis).

D. This makes the height of each bar equal to the number of cases in each group, and it is incompatible with mapping values to the y aesthetic. This stat basically allows the y-axis to represent particular values from the data instead of just counts.

E. This aesthetic allows further visualization of data by varying data points by shape (modifying their shape based on a particular variable aside from the x/y-axis).

F. Labels the y-axis.

Aesthetics and Functions

colour
dodge
fill
identity
ylab
shape

For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list above. For example: B <- 1

In [ ]:

# Assign your answer to a letter: A, B, C, D, E, F
# Make sure the correct answer is a numerical number from 1-6 
# Replace the fail() with your answer. 
# for example, we could match B with 1 by doing
# B <- 1

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_0.1()

Question 0.2 True or False:
{points: 1}

We should save a plot as an .svg file if we want to be able to rescale it without losing quality.

Assign your answer (either "true" or "false") to a variable named answer0.2.

In [ ]:

# Assign your answer to an object called: answer0.2
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer.
 
# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_0.2()

1. Data on Personal Medical Costs

As we saw in the worksheet, data scientists work in all types of organizations and with all kinds of problems. One of these types of organizations are companies in the private sector that work with health data. Today we will be looking at data on personal medical costs. There are varying factors that affect health and consequently medical costs. Our goal for today is to determine how are variables related to the medical costs billed by health insurance companies.

To analyze this, we will be looking at a dataset that includes the following columns:

age: age of primary beneficiary
sex: insurance contractor gender: female, male
bmi: body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/ $m^{2}$ ) using the ratio of height to weight, ideally 18.5 to 24.9
children: number of children covered by health insurance / number of dependents
smoker: smoking
region: the beneficiary's residential area in the US: northeast, southeast, southwest, northwest.
charges: individual medical costs billed by health insurance

This dataset, was taken from the collection of Data Sets created and curated for the Machine Learning with R book by Brett Lantz.

Question 1.1 Yes or No:
{points: 1}

Based on the information given in the cell above, do you think the column charges includes quantitative, numerical data?

Assign your answer (either "yes" or "no") to an object called answer1.1.

In [ ]:

# Assign your answer to an object called: answer1.1
# Make sure the correct answer is written in lower-case (yes / no)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.1()

Question 1.2 Multiple Choice:
{points: 1}

Assuming overplotting is not an issue, which plot would be the most effective to compare the relationship of age and charges?

A. Scatterplot

B. Stacked Bar Plot

C. Bar Plot

D. Histogram

Assign your answer to an object called answer1.2.

In [ ]:

# Assign your answer to an object called: answer1.2
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.2()

Question 1.3
{points: 1}

Read the insurance.csv file in the data/ folder and look at the last 6 individuals presented.

Assign your answer to an object called insurance.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
tail(insurance)

In [ ]:

test_1.3()

Question 1.4
{points: 3}

Looking over the loaded data shown above, what observations can you make about medical charges and age? How about medical charges and BMI? Finally, what about medical charges and smoking?

Also, comment on whether our observations might change if we visualize the data? And/or whether visualizing the data might allow us to more easily make observations about the relationships in the data as opposed to trying to make them directly from the data table?

Answer in the cell below.

YOUR ANSWER HERE

Question 1.5
{points: 1}

According to the National Heart, Lung and Blood Institute of the US: "The higher your BMI, the higher your risk for certain diseases such as heart disease, high blood pressure, type 2 diabetes, gallstones, breathing problems, and certain cancers".

Based on this information, we can hypothesize that individuals with a higher BMI are likely to have more medical costs. Let's use our data and see if this holds true. Create a scatter plot of charges (y-axis) versus bmi (x-axis).

In the scaffolding we provide below, we suggest that you set alpha to a value between 0.2 and 0.4. alpha sets the transparency of points on a scatter plot, and increasing transparencing of points is one tool you can use to deal with over plotting issues.

Assign your answer to an object called bmi_plot. Make sure to label your axes appropriately.

In [ ]:

#options(repr.plot.width = ..., repr.plot.height = ...) #Remember to set your plot sizes to an appropiate size

#bmi_plot <- insurance %>%
#    ggplot(aes(x = ..., y =  ...)) + 
#        geom_...(alpha = ...) + # Deals with the transparency of the points, set it to an appropiate value
#        xlab(...) +
#        ylab(...) +
#        ggtitle(...)

# your code here
fail() # No Answer - remove if you provide an answer
bmi_plot

In [ ]:

test_1.5()

Question 1.6
{points: 3}

Analysis: Comment on the effectiveness of the plot. Take into consideration the rules of thumb discussed in lecture. Also comment on what could be improved for this plot and what is done correctly.

Answer in the cell below.

YOUR ANSWER HERE

Question 1.7
{points: 3}

Analysis: What do you observe? Do the data suggest that there might be evidence that BMI may affect the medical costs of individuals?

Answer in the cell below.

YOUR ANSWER HERE

Question 1.8
{points: 3}

Again, based on information from the National Heart, Lung and Blood Institute of the US, smoking cigarettes is said to be a risk factor for obesity. Create the same plot as you did in Question 1.5 but this time add the colour aesthetic to observe if smoking might affect the body mass of individuals. Also, use labs to format your legend title.

Assign your answer to an object called smoke_plot. Make sure to label your axes appropriately.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
smoke_plot

In [ ]:

# Most of the tests for this question are hidden. You have to decide whether you've created a good visualization!
# Make sure you label all your axes and legend with human-readable labels
# You may want to pass alpha = 0.4 to the scatter geometric object to make the scatter points translucent
# (just for your own ease of visualization; you don't have to and we won't check that when grading)

# here's one test to at least ensure you named the plot object correctly:
expect_true(exists("smoke_plot"))

Question 1.9.0 (Analyzing the Graph) True or False:
{points: 1}

Smokers generally have a lower BMI than non-smokers.

Assign your answer to an object called answer1.9.0.

In [ ]:

# Assign your answer to an object called: answer1.9.0
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.9.0()

Question 1.9.1 (Analyzing the Graph) True or False:
{points: 1}

Smokers generally have higher medical charges than non-smokers.

Assign your answer to an object called answer1.9.1.

In [ ]:

# Assign your answer to an object called: answer1.9.1
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.9.1()

Question 1.10
{points: 1}

Lastly, create a bar graph that displays the proportion of females vs. males in the data set (you will need to add sex as the x-axis) to assess whether there is evidence that sex might influence medical costs. Fill in the bar to differentiate between smokers and non-smokers.

Assign your answer to an object called bar_plot. Make sure to label your axes appropriately.

In [ ]:

#bar_plot <- insurance %>%
#    ggplot(aes(x = ..., fill = ...)) + 
#    ..._...(position = 'fill') + 
#    xlab(...) +
#    ylab(...) +
#    labs(fill = "Does the person smoke") +
#    ggtitle(...)


# your code here
fail() # No Answer - remove if you provide an answer
bar_plot

In [ ]:

test_1.10()

Question 1.11
{points: 1}

Based on the graph, are there more male smokers in the data set or more female smokers?

Assign your answer to an object called answer1.11.

In [ ]:

# Assign your answer to an object called: answer1.11
# Make sure the correct answer is written in lower-case (male / female)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.11()

2. Color Palettes (beyond the defaults)

{points: 3}

In the worksheet and this tutorial, you have seen the same colours again and again. These are from the default ggplot2 color palette. What if you want different colors? We can do this! In R, one of the libraries that provides altenative color palettes is the RColorBrewer library.

For this question:

Load the RcolorBrewerlibrary
Print the list of palettes available for you with the display.brewer.all() function
Choose one of the palettes and apply it to the plot whose code is given in the cell below.
- For the fill aesthetic with categorical variable the function is: scale_fill_brewer(palette = '...)
- For the fill aesthetic with numeric variable the function is: scale_fill_distiller(palette = '...')

You can look more in depth into the documentation of the scale_fill_* functions here: https://ggplot2.tidyverse.org/reference/scale_brewer.html

In [ ]:

diamonds_plot <- diamonds %>%
    ggplot(aes(x = color, fill = clarity)) + 
    geom_bar(position = 'fill') +
    xlab('Diamond color') +
    ylab('Proportion') +
    labs(fill = "Diamond clarity") 

#Below, insert your colour palette choice via
#diamonds_plot <- diamonds_plot + 
#                     ...(...)


# your code here
fail() # No Answer - remove if you provide an answer
diamonds_plot

3. Fast-Food Chains in the United States (Continued)

{points: 6}

In Worksheet 04, we explored this data set through some visualizations. Now, it is is all up to you. The goal of this assignment is to create one plot that can help you figure out which restaurant to open and where! After that you need to write a paragraph explaining your visualization and why you chose it. Also, explain your conclusion from the visualization and reasoning as to how you came to that conclusion. If you need to bring in outside information to help you answer your question, please feel free to do so. Finally, if there is some way that you could improve your visualization, but don't yet know how to do it, please explain what you would do if you knew how.

In answering this question, there is no need to restrict yourself to the west coast of the USA. Consider all states that you have data for. You have a variety of graphs to choose from, but before starting the assignment, discuss with a partner which plot would be the most optimal to answer this question.

In [ ]:

# write the code for your plot here
# your code here
fail() # No Answer - remove if you provide an answer

Write a paragraph explaining your visualization and why you chose it. Also explain your conclusion from the visualization and reasoning as to how you came to that conclusion. If you need to bring in outside information to help you answer your question, please feel free to do so. Finally, if there is some way that you could improve your visualization, but don't yet know how to do it, please explain what you would do if you knew how.

YOUR ANSWER HERE

Tutorial 4: Effective Data Visualization

1. Data on Personal Medical Costs

2. Color Palettes (beyond the defaults)

3. Fast-Food Chains in the United States (Continued)

Product

Resources

Company