Lab 6: Resampling and Bootstrapping
Due Saturday, November 12th at 11:59pm
Welcome to Lab 6! In this assignment, we'll develop a further understanding of parameter estimation and bootstrapping, which you can learn more about in CIT 13. The relevant lectures are Lectures 14, 18, and 19.
This lab is due on Saturday, November 12th at 11:59pm.
Supplemental video on for-loops and when NOT to use them
We put together a video reviewing some of the ways to perform repetitive tasks (e.g. random sampling, performing operations to every element of a column) without using a for-loop. We'll also look at when exactly it is you do need a for-loop in this class (running an experiment many times). This is important, because using a for-loop when not necessary is a bad idea, as the resulting code is quite slow and hard to debug.
If you're feeling a little shaky on iteration and coding simulations, you may want to check it out!.
0. Percentiles 🅿️
Before we start, we need to introduce the concept of percentiles. Percentiles associate numbers in a dataset to their positions when the dataset is sorted in ascending order.
Given any sequence (i.e. list, array, or Series) of numerical values, imagine sorting the values in ascending order, to create a ranked sequence. Roughly speaking, the th percentile of this sequence is the value that is percent of the way through the sequence. For example, the 10th percentile is only 10% of the way through (towards the beginning), the 50th percentile is halfway through (towards the middle), and the 90th percentile is 90% of the way through (towards the end).
There are many different ways to precisely define a percentile. In this class, we'll consider two different approaches. You should think of these as two separate, different ways to define a percentile. They don't always agree!
The mathematical definition
Let be a number between 0 and 100. The th percentile of a collection is the smallest value in the collection that is at least as large as % of all the values.
With this definition, any percentile is always an element of the collection.
The numpy definition
The numpy package provides a function np.percentile that takes two inputs: an array of numbers and a value p. It returns a number that represents the pth percentile of the array. You don't need to know how it calculates this value, but you should know:
it's not always the same as the mathematical definition given above (though it is close), and
it's not always an element of the array.
Question 0.1.
Say, you are in a class with 10 students and the grades of all students in the class are stored in the array grades. Your score is 84.
Which of the following statements are true? Use the mathematical definition of percentile here.
The highest score is the 100th percentile.
Your score is more than the 80th percentile.
Your score is less than the 81st percentile.
Your score is the 86th percentile.
A score of 78 is the 50th percentile.
Assign true_percentile to a list containing the numbers of the true statements.
Question 0.2.
Use np.percentile to calculate the 50th percentile of the grades array and save the result as p_50.
Question 0.3.
Use np.median to calculate the median value of the grades array and save the result as median_grade.
Manually compare it to your answer from Question 0.2. Set the variable same to True if the two values are the same, and False if they are different. Do not use if/else for this question.
1. Allied Intelligence Preliminaries 🧠
Throughout this lab, we will study a statistical problem known as the German tank problem.
In World War II, the Allies (led by the US, the UK, and the Soviet Union) wanted to know how many military tanks the Germans had produced. However, they didn't get to see every single tank produced by the Germans – rather, all they saw was a random sample of tanks.
To frame the problem more precisely, consider that tanks were given serial numbers ranging from 1 to N, where N was the total number of tanks produced. The Allies were trying to estimate N, a population parameter, using the serial numbers of the tanks in their sample. We will assume that the Allies' sample is a simple random sample of the population (drawn without replacement).

In this lab, given just a random sample of serial numbers, we'll estimate N, and then we'll use simulation to find out how accurate our estimate likely is, without ever looking at the whole population. This is an example of statistical inference – inferring something about a population using just the information in a sample.
Question 1.1. Is N a population parameter or a statistic? If we compute a number using our random sample that's an estimate of N, is that a population parameter or a statistic? Assign either 1, 2, 3, or 4 to the variable preliminaries_q1 below.
Nis a population parameter. An estimate ofNfrom our random sample is a population parameter.Nis a population parameter. An estimate ofNfrom our random sample is a statistic.Nis a statistic. An estimate ofNfrom our random sample is a population parameter.Nis a statistic. An estimate ofNfrom our random sample is a statistic.
To make the situation realistic, we're going to hide the true number of tanks from you. You'll have access only to this random sample:
Question 1.2. Define a function named plot_serial_numbers that draws a histogram of any DataFrame of serial numbers. It should take one argument, a DataFrame df with a single column column called 'serial_number' (like observations). It should plot a histogram of the values in the 'serial_number' column using bins of width 1 ranging from 1 to 200 (inclusive) but return nothing. Then, call that function to make a histogram of observations.
Check your answer: Your histogram should have bars that are all the same height and the x-axis should range from 0 to 200.
Question 1.3. Since we are trying to estimate the population max, N, a natural statistic to use is the sample max. In other words, we can estimate the total number of tanks as being the biggest serial number in our sample.
Below, write a function called calculate_max_based_estimate that computes that statistic on a given Series of serial numbers. It should take as its argument a Series of serial numbers and return their max.
After that, use it to compute an estimate of N using the serial numbers in observations. Call the estimate max_based_estimate.
Question 1.4. Another way to estimate N is to take twice the mean of the serial numbers in our sample. Below, write a function called calculate_mean_based_estimate that computes that statistic. It should take as its argument a Series of serial numbers and return twice their mean.
After that, use it to compute an estimate of N using the serial numbers in observations. Call the estimate mean_based_estimate.
Question 1.5. Look at the values of max_based_estimate and mean_based_estimate that we happened to get for our dataset:
The value of max_based_estimate tells you something about mean_based_estimate. Could our current mean_based_estimate possibly be equal to N (at least if we round it to the nearest integer)? If not, is it definitely higher, definitely lower, or can we not tell? Assign one of the choices (1-6) to the variable preliminaries_q5 below.
Yes, our
mean_based_estimatefor this sample could equalN.No, our
mean_based_estimatefor this sample cannot be equal toN, it is definitely lower by roughly 3.No, our
mean_based_estimatefor this sample cannot be equal toN, it is definitely lower by at least 12.No, our
mean_based_estimatefor this sample cannot be equal toN, it is definitely higher by roughly 3.No, our
mean_based_estimatefor this sample cannot be equal toN, it is definitely higher by at least 12.No, our
mean_based_estimatefor this sample cannot be equal toN, but we cannot tell if it is lower or higher.
We can't just confidently proclaim that max_based_estimate or mean_based_estimate are equal to N, because we don't know what N actually is. What if we're really far off? We want to get a sense of the accuracy of our estimates.
2. Resampling 🥾
If we had access to the entire population, we could repeatedly draw samples from the population and compute our estimate using each sample. This would give an empirical distribution of estimate, which we could use to see how wrong our estimates tend to be. This is what we did in Lecture 14.
Unfortunately, we don't have access to the entire population (i.e. we don't know the value of N). All we have access to is a single sample of serial numbers. How do we tell how accurate our estimates are without being able to sample repeatedly from the population to create an empirical distribution? 🤔
One strategy is to repeatedly sample from our sample, or "resample", and use those resamples to compute an empirical distribution of our estimate. Let's talk about why this is a reasonable strategy.
When we tried to determine
N, the number of tanks, we would have liked to use the whole population. Since we had only a sample, we used that to estimateNinstead.Similarly, now we would like to use the population of serial numbers to run a simulation to help us understand how different estimates of
Nmight have turned out. But we still only have our sample, so can we use that instead? We can!Since large random samples resemble the populations they are drawn from, and our sample is relatively large, we can treat our sample as if it is the population, and sample from it.
When we resample from our original sample, we sample uniformly at random with replacement and create a resample that has the same number of elements as the original sample. (In Question 4, we'll look at why we must resample with replacement.)
Here's an analogy between estimating N and simulating the variability of our estimates:
The process of resampling from our original sample is known as bootstrap resampling. Run the cell below to walk through an animation that illustrates how bootstrapping works.
Bootstrapping is a really tricky idea, so please ask for help if you're confused!
Question 2.1. Write a function called simulate_resample. It should take no arguments, and it should generate a resample (again, with replacement) from the observed serial numbers in observations and return that resample. (The resample should be a DataFrame like observations.)
Hint: Use the .sample method.
Later, we'll use many resamples at once to see what estimates typically look like. We don't often pay attention to single resamples, so it's easy to misunderstand them. Let's examine some individual resamples before we start using them.
Question 2.2. Make a histogram of your one_resample and a separate histogram of the original observations. Make sure to use the function plot_serial_numbers that you defined earlier in the lab.
Question 2.3. Which of the following are true:
In the plot of the resample, there are no bars at locations that weren't there in the plot of the original observations.
In the plot of the original observations, there are no bars at locations that weren't there in the plot of the resample.
There are no duplicate serial numbers in the resample.
There are no duplicate serial numbers in the original observations.
Assign true_statements to a list of the correct statements
Question 2.4. Create 2 more resamples. For each resample, plot a histogram and compute the max-based and mean-based estimates using that resample.
There's a good chance that you'll find that the max-based estimates from the resamples are both exactly 135 (run the cell a few times and you'll almost surely see this happen). You'll also probably find that the two mean-based estimates differ from the sample mean-based estimate (and from each other).
Question 2.5. Compute the exact probability that a max-based estimate from one resample of our observations sample is 135 and assign it to the variable resampling_q5 below. It may be useful to recall that the size of observations is 17.
Note that this is a math question, not a programming one. It may help to figure out your answer on paper and then assign resampling_q5 to an expression that evaluates to the right answer.
Hint: Think about the "grandma" example from Lecture 12. What is the probability that any one of the elements in our resample is equal to 135?
The correct answer is high, above 60%. Think about why a mean-based estimate from a resample is less likely to be exactly equal to the mean-based estimate from the original sample as compared to a max-based estimate.
3. Resampling via Simulation 💻
Since resampling from a large random sample looks just like sampling from a population, the code should look almost the same, too. That means we can write a function that simulates either sampling from a population or resampling from a sample. If we pass it a population as its argument, it will do the former; if we pass it a sample, it will do the latter.
Question 3.1. Write a function called simulate_estimates. It should take 4 arguments:
original_df: A DataFrame from which the data should be sampled, with 1 column named'serial_number'.sample_size: The size of each sample, an integer. (For example, to do resampling, we would pass the number of rows inoriginal_dffor this argument.)statistic: A function that computes a statistic on a sample. This argument is the name of a function that takes a Series of serial numbers as its argument and returns a number (e.g.calculate_mean_based_estimate).repetitions: The number of repetitions to perform (i.e. the number of resamples to create).
It should simulate repetitions samples with replacement from the given DataFrame. For each of those samples, it should compute the statistic on that sample. Then it should return an array containing the value of that statistic for each sample (this means that the length of the returned array should be equal to repetitions).
The code below provides an example use of your function and describes how you can verify that you've written it correctly.
Check your answer: The histogram you see should be a bell-shaped curve centered at 1000 with most of its mass in [800, 1200].
Now we can go back to the sample we actually observed (observations) and estimate how much our mean-based estimate of N would have varied from sample to sample.
Question 3.2. Using the bootstrap procedure and the sample observations, simulate the approximate distribution of mean-based estimates of N. Use 5,000 repetitions. Store the estimates in bootstrap_estimates. (Note that this only requires one line of code; call your simulate_estimates function.)
We have provided code that plots a histogram, allowing you to visualize the simulated estimates.
Question 3.3. Compute an interval that covers the middle 95% of the bootstrap estimates. Verify that your interval looks like it covers 95% of the area in the histogram above.
Hint 1: Use np.percentile here.
Hint 2: If you find yourself using 5 and 95 as the arguments to np.percentile, try again – only 90% of the data is between the 5th and 95th percentiles!
Question 3.4. Let's say that N, the population parameter we've been trying to estimate, is actually 150. Write code that simulates the sampling and bootstrapping process again, as follows:
Generate a new set of random observations the Allies might have seen by sampling from the population DataFrame we have created for you below. Take a sample of size 70 without replacement. Store the sample in the variable name
new_observationsUsing only
new_observations– notpopulation– compute 5,000 bootstrapped mean-based estimates ofN. To do this, call yoursimulate_estimatesfunction.Compute an interval covering the middle 95% of these bootstrapped mean-based estimates.
Question 3.5. If you ran your cell above many, many times, approximately what percentage of the intervals you created would include N (150 in this case)? Assign either 1, 2, 3, 4, or 5 to the variable simulating_q5 below.
100%
97.5%
95%
5%
It's impossible to tell.
4. With or Without Replacement? 🔂
Each time we resampled from our original sample, we sampled with replacement. What would happen if we tried to resample without replacement? Let's find out!
Below, we will collect another random sample of size 70 from population that we can then resample from. We'll call it original_sample.
Question 4.1. Below, 5,000 times, collect a resample of size 70 from original_sample without replacement. Compute the mean-based estimate on each resample, and store the estimates in the array estimates_without_replacement.
Note: You cannot use your simulate_estimates function here, because that samples with replacement. Instead, you'll have to write a new for-loop. It's a good idea to start by copying the code from your function in 3.1 and changing the necessary pieces.
Question 4.2. If you completed 4.1 correctly, you'll notice that all 5,000 of your estimates are identical, and are equal to roughly 149.5143. Furthermore, this number is equal to the mean-based estimate derived from original_sample, without any resampling:
Why are all of our estimates identical, and why must we sample with replacement when resampling?
Type your answer here, replacing this text.
Finish Line 🏁
Congratulations! You are done with Lab 6.
To submit your assignment:
Select
Kernel -> Restart & Run Allto ensure that you have executed all cells, including the test cells.Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using
File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.