GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2020-fall/materials/worksheet_12/worksheet_12.ipynb
²⁷¹⁵ views

Kernel: R

Worksheet 12 - The Bootstrap

We will cover 2 things today:

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Explain why we don't have a sampling distribution in practice/real life.
Define bootstrapping.
Use R to create a bootstrap distribution to approximate a sampling distribution.
Contrast bootstrap and sampling distributions.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(repr)
library(digest)
library(infer)
library(gridExtra)
source('tests_worksheet_12.R')
source("cleanup_worksheet_12.R")

Question 1.1 True/False:
{points: 1}

In real life, we typically take many samples from the population and create a sampling distribution when we perform estimation. True or false?

Assign your answer to an object called answer1.1. Your answer should be in lowercase letters and is surrounded by quotes (e.g. "true" or "false").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.1()

Question 1.2 Ordering
{points: 1}

Correctly re-order the steps for creating a bootstrap sample from those listed below.

record the observation's value
repeat the above the same number of times as there are observations in the original sample
return the observation to the original sample
randomly draw an observation from the original sample (which was drawn from the population)

Create your answer by reordering values below in the answer1.2 vector with the correct order for the steps above for creating a bootstrap sample.

In [ ]:

answer1.2 <- c(1, 2, 3, 4) # reorder the values!
# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.2()

Question 1.3 Multiple choice
{points: 1}

From the list below, choose the correct explanation of a bootstrap distribution for a point estimate:

A. a list of point estimates calculated from many samples drawn with replacement from the population

B. a list of point estimates calculated from many samples drawn without replacement from the population

C. a list of point estimates calculated from bootstrap samples drawn with replacement from a single sample (that was drawn from the population)

D. a list of point estimates calculated from bootstrap samples drawn without replacement from a single sample (that was drawn from the population)

Assign your answer to an object called answer1.3. Your answer should be an uppercase letter and is surrounded by quotes. (e.g. "F")

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.3()

Question 1.4 Multiple choice
{points: 1}

From the list below, choose the correct explanation of why, when performing estimation, we want to report a plausible range for the true population quantity we are trying to estimate along with the point estimate:

A. The point estimate is our best guess at the true population quantity we are trying to estimate

B. The point estimate will often not be the exact value of the true population quantity we are trying to estimate

C. The value of a point estimate from one sample might very well be different than the value of a point estimate from another sample.

D. B & C

F. A & C

E. None of the above

Assign your answer to an object called answer1.4. Your answer should be an uppercase letter and is surrounded by quotes (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.4()

Continuing with our virtual population of Canadian seniors from last worksheet

Here we re-create the virtual population (ages of all Canadian seniors) we used in the last worksheet. It was bounded by realistic values ( $\geq$ 65 and $\leq$ 118):

In [ ]:

# run this cell to simulate a finite population
set.seed(4321) # DO NOT CHANGE
can_seniors <- tibble(age = (rexp(2000000, rate = 0.1)^2) + 65) %>% 
    filter(age <= 117, age >= 65)
head(can_seniors)

Let's remind ourselves of what this population looks like:

In [ ]:

# run this cell 
options(repr.plot.width = 8, repr.plot.height = 7)
pop_dist <- ggplot(can_seniors, aes(age)) + 
    geom_histogram(binwidth = 1) +
    xlab("Age (years)") +
    ggtitle("Population distribution") +
    theme(text = element_text(size = 20), plot.margin = margin(10, 100)) # last x value was getting cut off
pop_dist

Estimate the mean age of Canadian Seniors

Let's say we are interested in estimating the mean age of Canadian Seniors. Given that we have the population (we created it) we could just calculate the mean age from this population data. However in real life, we usually only have one small-ish sample from the population. Also, from our experimentation with sampling distributions, we know that different random samples will give us different point estimates. We also know from these experiments that the point estimates from different random samples will mostly be close to the true population quanitity we are trying to estimate, and how close depends on the sample size.

What about in real life though, when we only have one sample? Can we say how close? Or at least give some plausible range of where we would expect the population quanitity we are trying to estimate to fall? Yes! We can do this using a method called bootstrapping! Let's explore how to create a bootstrap distribution from a single sample using R. And then we will discuss how the bootstrap distribution relates to the sampling distribution, and what it can tell us about the true population quantity we are trying to estimate.

Let's draw a single sample of size 40 from the population and visualize it:

In [ ]:

# run this cell 
set.seed(12345) 
one_sample <- can_seniors %>% 
    rep_sample_n(40) %>% 
    ungroup() %>% # ungroup the data frame 
    select(age) # drop the replicate column 
head(one_sample)

An aside
To make this work with the infer package functions, we need to drop the first column (named replicate) as when we are bootstrapping we will be generating this column again. We also need to ungroup this data frame too...
This is because in real life you would only have one sample (and hence no replicate column and the data would be ungrouped).

In [ ]:

options(repr.plot.width = 8, repr.plot.height = 7)
# run this cell 
one_sample_dist <- ggplot(one_sample, aes(age)) + 
    geom_histogram(binwidth = 1) +
    xlab("Age (years)") +
    ggtitle("Distribution of one sample") +
    theme(text = element_text(size = 20))
one_sample_dist

Question 1.5
{points: 1}

Use summarise to calculate the mean age (our point estimate of interest) from the random sample you just took (one_sample):

Name this data frame one_sample_estimates which has a single column named mean.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
one_sample_estimates

In [ ]:

test_1.5()

Question 1.6
{points: 1}

To generate a single bootstrap sample in R, we can use the rep_sample_n function as we did when we were creating our sampling distribution, however this time we change the argument for replace from its default value of FALSE to TRUE.

Use rep_sample_n to take a single bootstrap sample from the sample you drew from the population. Name this bootstrap sample boot1.

Use 4321 as the seed.

In [ ]:

set.seed(4321)  # DO NOT CHANGE!
# ... <- ... %>%
#    rep_sample_n(size = ..., replace = ..., reps = ...)
# your code here
fail() # No Answer - remove if you provide an answer
head(boot1)

In [ ]:

test_1.6()

Question 1.7 Multiple choice
{points: 1}

Why do we change replace to TRUE?

A. Taking a bootstrap sample involves drawing observations from the original population without replacement

B. Taking a bootstrap sample involves drawing observations from the original population with replacement

C. Taking a bootstrap sample involves drawing observations from the original sample without replacement

D. Taking a bootstrap sample involves drawing observations from the original sample with replacement

Assign your answer to an object called answer1.7. Your answer should be an uppercase letter and is surrounded by quotes (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.7()

Question 1.8
{points: 1}

Visualize the distribution of the bootstrap sample you just took (boot1) that was just created by plotting a histogram using binwidth = 1 in the geom_histogram argument. Name the plot boot1_dist and give the plot (using ggtitle) and the x-axis a descriptive label.

In [ ]:

options(repr.plot.width = 8, repr.plot.height = 7)
# your code here
fail() # No Answer - remove if you provide an answer
boot1_dist

In [ ]:

test_1.8()

Let's now compare our bootstrap sample to the original random sample that we drew from the population:

In [ ]:

# run this code cell
options(repr.plot.width = 15, repr.plot.height = 7)
grid.arrange(one_sample_dist, boot1_dist, ncol = 2)

Earlier we calculate the mean of our original sample to be about 79.6 years. What is the mean of our bootstrap sample?

In [ ]:

# run this cell
summarise(boot1, mean = mean(age))

We see that original sample distrbution and the bootstrap sample distribution are of similar shape, but not identical. They also have different means. The difference of the frequency of the values in the bootstrap sample (and the difference of the value of the mean) comes from sampling from the original sample with replacement. Why sample with replacement? If we didn't we would end up with the original sample again. What we are trying to do with bootstrapping is to mimic drawing another sample from the population, without actually doing that.

Why are we doing this? As mentioned earlier, in real life we typically only have one sample and thus we cannot create a sampling distribution that we can use to tell us about how we might expected our point estimate to behave if we took another sample. So what we can do, is use our sample as an estimate of our population, and sample from that with replacement (i.e., bootstrapping) many times to create many bootstrap samples. We can then calculate point estimates for each bootstrap sample and create a bootstrap distribution of our point estimates and use this as a proxy for a sampling distribution. We can finally use this bootstrap distribution of our point estimates to suggest how we might expected our point estimate to behave if we took another sample.

Question 1.9
{points: 1}

What do 6 different bootstrap samples look like? Use rep_sample_n to create a single data frame with 6 bootstrap samples drawn from the original sample we drew from the population, one_sample. Name the data frame boot6.

Set the seed as 1234.

In [ ]:

set.seed(1234)
# your code here
fail() # No Answer - remove if you provide an answer
head(boot6)
tail(boot6)

In [ ]:

test_1.9()

Question 2.0
{points: 1}

Now visualize the six bootstrap sample distributions from boot6 by faceting the replicate column. Name the plot object boot6_dist and give the plot (using ggtitle) and the x-axis a descriptive label.

In [ ]:

options(repr.plot.width = 15, repr.plot.height = 7)
# ... <- ggplot(...) +
#   ...(binwidth = 1) +
#  facet_wrap(~ ...)
# your code here
fail() # No Answer - remove if you provide an answer
boot6_dist

In [ ]:

test_2.0()

Question 2.1
{points: 1}

Calculate the mean of these 6 bootstrap samples using group_by and summarize the data into a column called mean. Name the data frame boot6_means.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
boot6_means

In [ ]:

test_2.1()

Question 2.2
{points: 1}

Let's now take 1000 bootstrap samples from the original sample we drew from the population (one_sample) using rep_sample_n. Name the data frame boot1000.

Set the seed as 1234.

In [ ]:

set.seed(1234)
# your code here
fail() # No Answer - remove if you provide an answer
head(boot1000)
tail(boot1000)

In [ ]:

test_2.2()

Question 2.3
{points: 1}

Calculate the mean of these 1000 bootstrap samples using group_by and summarize the data into a column called mean. Name the data frame boot1000_means.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(boot1000_means)
tail(boot1000_means)

In [ ]:

test_2.3()

Question 2.4
{points: 1}

Visualize the distribution of the bootstrap sample point estimates (boot1000_means) you just calculated by plotting a histogram using binwidth = 1 in the geom_histogram argument. Name the plot boot_est_dist and give the plot (using ggtitle) and the x-axis a descriptive label.

In [ ]:

options(repr.plot.width = 8, repr.plot.height = 7)
# your code here
fail() # No Answer - remove if you provide an answer
boot_est_dist

In [ ]:

test_2.4()

How does the bootstrap distribution above compare to the sampling distribution? Let's visualize them side by side:

In [ ]:

# run this cell

# create sampling distribution histogram
set.seed(4321)
samples <- rep_sample_n(can_seniors, size = 40, reps = 1500)
sample_estimates <- samples %>% 
    group_by(replicate) %>% 
    summarise(mean = mean(age))
sampling_dist <-  ggplot(sample_estimates, aes(x = mean)) +
    geom_histogram(binwidth = 1) +
    xlab("Mean Age (years)") +
    ggtitle("Sampling distribution of the sample means") +
    annotate("text", x = 85, y = 200, label = paste("mean = ", round(mean(sample_estimates$mean), 1))) +
    theme(text = element_text(size = 20))

# annotate distribution mean to the bootstrap dist
boot_est_dist <- boot_est_dist + 
    annotate("text", x = 84, y = 160, label = paste("mean = ", round(mean(boot1000_means$mean), 1)))

# plot bootstrap distribution beside sampling distribution
options(repr.plot.width = 15, repr.plot.height = 5)
grid.arrange(boot_est_dist, sampling_dist, ncol = 2)

Reminder: the true population quantity we are trying to estimate, the population mean, is about 79.3 years. We know this because we created this population and calculated this value. In real life we wouldn't know this value.

Question 2.5 True/False
{points: 1}

The mean of the distribution of the bootstrap sample means is the same value as the mean of the sampling distribution of the sample means. True or false?

Assign your answer to an object called answer2.5. Your answer should be in lowercase letters and is surrounded by quotes (e.g. "true" or "false").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.5()

Question 2.6 True/False
{points: 1}

The mean of the bootstrap sampling distribution is not the same value as the mean of the sampling distribution because the bootstrap sampling distribution was created from samples drawn from a single sample, whereas the sampling distribution was created from samples drawn from the population. True or false?

Assign your answer to an object called answer2.6. Your answer should be in lowercase letters and is surrounded by quotes (e.g. "true" or "false").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.6()

Question 2.7 True/False
{points: 1}

The shape and spread (i.e. width) of the distribution of the bootstrap sample means is a poor approximation of the shape and spread of the sampling distribution of the sample means. True or false?

Assign your answer to an object called answer2.7. Your answer should be in lowercase letters and is surrounded by quotes (e.g. "true" or "false").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.7()

Question 2.8 True/False
{points: 1}

In real life, where we only have one sample and cannot create a sampling distribution, the distribution of the bootstrap sample estimates (here means) can suggest how we might expect our point estimate to behave if we took another sample. True or false?

Assign your answer to an object called answer2.8. Your answer should be in lowercase letters and is surrounded by quotes (e.g. "true" or "false").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.8()

Using the bootstrap distribution to calculate a plausible range for point estimates

Once we have created a bootstrap distribution, we can use it to suggest a plausible range where we might expect the true population quantity to lie. One formal name for a commonly used plausible range is called a confidence interval. Confidence intervals can be set at different levels, an example of a commonly used level is 95%. When we report a point estimate with a 95% confidence interval as the plausible range, formally we are saying that if we repeated this process of building confidence intervals more times with more samples, we’d expect ~ 95% of them to contain the value of the population quantity.

How do you choose a level for a confidence interval? You have to consider the downstream application of your estimation and what the cost/consequence of an incorrect estimate would be. The higher the cost/consequence, the higher a confidence level you would want to use. You will learn more about this in later Statistics courses.

To calculate an approximate 95% confidence interval using bootstrapping, we essentially order the values in our bootstrap distribution and then take the value at the 2.5th percentile as the lower bound of the plausible range, and the 97.5th percentile as the upper bound of the plausible range.

In [ ]:

# run this cell
boot1000_means %>% 
    select(mean) %>% 
    pull() %>% 
    quantile(c(0.025, 0.975))

Thus, to finish our estimation of the population quantity that we are trying to estimate, we would report the point estimate and the lower and upper bounds of our confidence interval. We would say something like this:

Our sample mean age for Canadian seniors was measured to be 77.8 years, and we’re 95% “confident” that the true population mean for Canadian seniors is between (73.7, 82.0).

Here our 95% confidence interval does contain the true population mean for Canadian seniors, 79.3 years - pretty neat! However, in real life we would never be able to know this because we only have observations from a single sample, not the whole population.

Question 2.9 True/False
{points: 1}

Assuming we knew the true population quantity we are trying to estimate (so we could verify this):

For any sample we take, if we use bootstrapping to calculate the 95% confidence intervals, the true population quantity we are trying to estimate would always fall within the lower and upper bounds of the confidence interval. True or false?

Assign your answer to an object called answer2.9. Your answer should be in lowercase letters and is surrounded by quotes (e.g. "true" or "false").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.9()

Optional: getting R + Jupyter working for you outside of our course

At some point after the course is done, you will lose access to the JupyterHub server where you have been doing your course work. If you want to continue to use R + Jupyter (for other courses at UBC, or for your work after UBC) you have two options:

a server solution
a local installation solution

We will point you to how you can do both, as well as provide guidance to take a copy of your homework from our Canvas JupyterHub server.

Getting your files off of the Canvas JupyterHub

On the Canvas JupyterHub open a terminal and type: tar -zcvf dsci-100.tar.gz dsci-100
Go to the JupyterHub Control Panel/Home and click the box beside the file dsci-100.tar.gz, and then click "Download"
Log onto either the https://ubc.syzygy.ca/ (CWL authentication) or https://cybera.syzygy.ca/ (Google authentication) servers
Click the white "Upload" button (beside the "New" button), and select tar -zxvf dsci-100.tar.gz. Then click the blue "Upload" button to complete the upload.
open a terminal and type: tar -zxvf dsci-100.tar.gz

You should have your files now available on either of those servers above.

On Mac or Linux tar also works in terminal, and so you can use this strategy to have the files available locally, however you will need Jupyter installed to view them.

1. A Server Solution

As a student at UBC, you have access to another JupyterHub that you can access using your UBC CWL: https://ubc.syzygy.ca/
If you have a Google account, you have access to another JupyterHub that does not depend on you being a UBC student (i.e., having a valid CWL): https://cybera.syzygy.ca/

2. A Local Installation Solution

Depending on your device, you may be able to install Jupyter and R on it. Below we provide instructions for the 3 major operating systems (Mac, Windows and Linux)

A. Install Jupyter + R on Mac

A.1. Install R

Go to https://cran.r-project.org/bin/macosx/ and download the latest version of R for Mac (Should look something like this: R-3.5.1.pkg). Open the file and follow the installer instructions.

A.2. Install Jupyter

Head to https://www.anaconda.com/download/#macos and download the Anaconda version for Mac OS. Follow the instructions on that page to run the installer. Anaconda is a software bundle installs Jupyter and Python and a few other tools. We recommend using it because it makes installation much simpler than other methods.

A.3. Install the IR kernel

The IR kernel lets you connect R with Jupyter, to install it follow the instructions below:

Open terminal (how to video), and type R to open R.
Type the following commands into R:

install.packages('IRkernel', repos = 'http://cran.us.r-project.org')

IRkernel::installspec(user = FALSE)

Next, exit R by typing: q() (type n when prompted to not save the workspace)

A.4 Test if it works:

Open terminal and type: jupyter notebook

The above command should open a web browser, with Jupyter's home inside it. Try creating a new R notebook and running some simple R code (e.g., print("hello!")) to test that it works.

B. Install Jupyter + R on Windows

B.1. Install R

Go to https://cran.r-project.org/bin/windows/base/ and download the latest version of R for Windows (Should look something like this: Download R 3.5.1 for Windows). Open the file and follow the installer instructions.

B.2. Install Jupyter

Head to https://www.anaconda.com/download/#windows and download the Anaconda version for Windows with Python 3.6. After the download has finished, run the installer selecting the following options:

On the Advanced Installation Options page, check both boxes (see image below)
For all other pages, use the default options

B.3. Install the IR kernel

The IR kernel lets you connect R with Jupyter, to install it follow the instructions below:

Open R (double click on Desktop Icon or open it using the START menu)
Type the following commands into R:

install.packages('IRkernel', repos = 'http://cran.us.r-project.org')

IRkernel::installspec(user = FALSE)

Next, exit/close R

B.4 Test if it works:

Open Windows Command Prompt (select the Start button, type cmd, and click Command Prompt from the list) and type: jupyter notebook

The above command should open a web browser, with Jupyter's home inside it. Try creating a new R notebook and running some simple R code (e.g., print("hello!")) to test that it works.

C. Install Jupyter + R on Linux

C.1. Install R

Follow the instructiond here: https://ubc-mds.github.io/resources_pages/install_ds_stack_ubuntu/#r-irkernel-and-rstudio (note - you do not need to install RStudio).

C.2. Install Jupyter

Head to https://www.anaconda.com/download/#linux and download the Anaconda version for Linux with Python 3.6. Follow the instructions on that page to run the installer. Anaconda is a software bundle installs Jupyter and Python and a few other tools. We recommend using it because it makes installation much simpler than other methods.

C.3. Install the IR kernel

The IR kernel lets you connect R with Jupyter, to install it follow the instructions below:

Open the terminal, and type R to open R.
Type the following commands into R:

install.packages('IRkernel', repos = 'http://cran.us.r-project.org')

IRkernel::installspec(user = FALSE)

Next, exit R by typing: q() (type n when prompted to not save the workspace)

C.4 Test if it works:

Open terminal and type: jupyter notebook

The above command should open a web browser, with Jupyter's home inside it. Try creating a new R notebook and running some simple R code (e.g., print("hello!")) to test that it works.

In [ ]:

source("cleanup_worksheet_12.R")

Worksheet 12 - The Bootstrap

Lecture and Tutorial Learning Goals:

Continuing with our virtual population of Canadian seniors from last worksheet

Estimate the mean age of Canadian Seniors

An aside

Using the bootstrap distribution to calculate a plausible range for point estimates

Optional: getting R + Jupyter working for you outside of our course

Getting your files off of the Canvas JupyterHub

1. A Server Solution

2. A Local Installation Solution

A. Install Jupyter + R on Mac

A.1. Install R

A.2. Install Jupyter

A.3. Install the IR kernel

A.4 Test if it works:

B. Install Jupyter + R on Windows

B.1. Install R

B.2. Install Jupyter

B.3. Install the IR kernel

B.4 Test if it works:

C. Install Jupyter + R on Linux

C.1. Install R

C.2. Install Jupyter

C.3. Install the IR kernel

C.4 Test if it works:

Product

Resources

Company