GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/homeworks/hw06/hw06.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

Homework 6: Permutation Testing, Percentiles, and Bootstrapping

Due Tuesday, November 15th at 11:59PM

Welcome to Homework 6! This homework will cover:

Permutation Testing (see CIT 12.0-12.2)
Percentiles (see CIT 13.1)
Bootstrapping and Confidence Intervals (see CIT 13.2, and CIT 13.3)

Instructions

This assignment is due Tuesday, November 15th at 11:59PM. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

Important: For homeworks, the otter tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the Calendar) or EdStem. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

In [ ]:

# Don't change this cell; just run it. 
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')

import otter
grader = otter.Notebook()

%reload_ext pandas_tutor

Aside: Random Seeds 🌱

Throughout this homework – and the Final Project – you'll notice that we frequently call the function np.random.seed with an integer argument. What exactly does that do?

To see for yourself, run the cell below several times.

In [ ]:

np.random.seed(25)

print(np.random.multinomial(10, [0.5, 0.5]))
print(np.random.multinomial(10, [0.5, 0.5]))

np.random.multinomial(10, [0.5, 0.5]) should return a random result each time it's called. However, each time you ran the cell above, you saw the same output – [7 3] and [5 5].

If you call np.random.seed in a cell, then every time you run the cell, you will see the same results, even if there are calls to "random" functions and methods in the cell. Think of calling np.random.seed as "undoing" the randomness in the cell. If you change the 25 above to some other number, you may see something other than [7 3] and [5 5], but each time you run the changed cell, you will still see the same result.

We use seeds to make it easier to autograde questions that rely on randomness, such as those that require you to bootstrap. When we use a particular seed in a question, we know exactly what the correct answer should be. When we don't, the range of correct answers is much wider, so it's harder to tell whether you actually answered the question correctly.

You're not responsible for understanding how seeds and random number generators work under the hood – all you need to know is that when you see a call to np.random.seed:

Don't change it.
Don't be alarmed if you see the same results each time you run that cell.

If you're interested in learning more, read this Wikipedia article.

1. Rideshare Rates in Boston 🚕📱

In this section, we will work with a dataset of rideshare data from Kaggle. The dataset contains information on Lyft and Uber rides from Boston during November and December of 2018. The data has been cleaned and condensed for the purposes of this question.

The rideshare data contains six columns: 'app', 'mode', 'destination', 'source', 'distance', 'price'. Let's read it in and store it as a DataFrame called rideshare.

Column	Description
`'app'`	Rideshare App (Lyft or Uber)
`'mode'`	Ride tier/mode
`'destination'`	Destination area in Boston
`'source'`	Ride origin area in Boston
`'distance'`	Ride distance (miles)
`'price'`	Ride price (USD)

In [ ]:

rideshare = bpd.read_csv('data/rideshare_boston.csv')
rideshare

Question 1.1. The 'mode' column contains several categories, which represent the ride tiers. A ride tier is an option for the type of car you're requesting. For example, when you request an Uber, one tier is an UberXL, which corresponds to a larger vehicle like an SUV. Another is Uber Pool, which is for a shared ride. (Uber Pool was phased out in 2020 due to COVID, but has recently been reintroduced as UberX Share.)

Below, assign lyft_modes to an array of the names of all unique Lyft ride modes, and uber_modes to an array of the names of all unique Uber ride modes.

In [ ]:

lyft_modes = ...
uber_modes = ...

# Don't change the following two lines:
print('lyft_modes:', lyft_modes)
print('uber_modes:', uber_modes)
# lyft_modes, uber_modes

In [ ]:

grader.check("q1_1")

For the next few problems, we will be working with basic Lyft and UberX ride modes only, since they are the most commonly used ride tiers for each company. Run the next cell to query only the basic Lyft rides and UberX rides. We are saving these rides in a DataFrame called economy_rides.

In [ ]:

lyft_mode = (rideshare.get('app') == 'Lyft') & (rideshare.get('mode') == 'Lyft')
uber_x_mode = (rideshare.get('app') == 'Uber') & (rideshare.get('mode') == 'UberX')
economy_rides = rideshare[lyft_mode | uber_x_mode].reset_index(drop=True)
economy_rides

Moving forward, "Lyft" will refer to Lyft mode rides, and "Uber" will refer to UberX mode rides. Subsequent questions will use the economy_rides DataFrame, not the rideshare DataFrame.

Question 1.2. To compare the price of Lyft rides and Uber rides, let’s find the price per mile for each ride. For example, a ride that cost $10 and that drove a distance of 2.5 miles has a price per mile of $\frac{10}{2.5}$ = $4 per mile.

In the economy_rides DataFrame, add a new column named 'price_per_mile' which contains the price per mile for each ride. Then, find the max, min, median, and mean 'price_per_mile' of all rides in economy_rides, and save these values in this order in an array called price_stats.

In [ ]:

economy_rides = ...
price_stats = ...
price_stats

In [ ]:

grader.check("q1_2")

Question 1.3. Using the economy_rides DataFrame, calculate the difference between the mean 'price_per_mile' of Uber rides and Lyft rides. Assign your answer to observed_difference.

\text{observed difference} = \text{mean Uber price per mile} - \text{mean Lyft price per mile}

In [ ]:

observed_difference = ...
observed_difference

In [ ]:

grader.check("q1_3")

Question 1.4. What does the number you obtained for observed_difference mean? Assign q1_4 to 1, 2, 3, or 4, corresponding to the best explanation below.

In our sample, the mean Uber price per mile is higher than the mean Lyft price per mile by about 44 percent.
In our sample, the mean Uber price per mile is higher than the mean Lyft price per mile by about 44 cents per mile.
In our sample, the mean Uber price per mile is lower than the mean Lyft price per mile by about 44 percent.
In our sample, the mean Lyft price per mile is higher than the mean Uber price per mile by about 44 cents per mile.

In [ ]:

q1_4 = ...

In [ ]:

grader.check("q1_4")

Now we want to conduct a permutation test (i.e. an A/B test) to see if it is by chance that the average price per mile for Uber rides is higher than the average price per mile for Lyft rides in our sample, or if Uber rides really are more expensive per mile on average than Lyft rides.

Null Hypothesis: The prices per mile of Uber rides and Lyft rides come from the same distribution.
Alternative Hypothesis: The prices per mile of Uber rides are higher on average than the prices per mile of Lyft rides.

Question 1.5. Assign uber_lyft_price to a DataFrame with only two columns, 'app' and 'price_per_mile', since these are the only relevant columns in economy_rides for this permutation test.

In [ ]:

uber_lyft_price = ...
uber_lyft_price

In [ ]:

grader.check("q1_5")

Question 1.6. To perform the permutation test, 500 times, create two random groups by shuffling the 'app' column of uber_lyft_price. Don't change the 'price_per_mile' column. For each pair of random groups, calculate the difference in mean price per mile (Uber minus Lyft) and store your 500 differences in the differences array.

Note: Since we are working with a relatively large data set, it may take up to five minutes to generate 500 permutations. One suggestion is to make sure your code works correctly with fewer repetitions, say, 20, before using 500 repetitions.

In [ ]:

differences = ...
...

# Just display the first ten differences.
differences[:10]

In [ ]:

grader.check("q1_6")

Question 1.7. Compute a p-value for this hypothesis test and assign your answer to p_val. To decide whether to use <= or >= in the calculation of the p-value, think about whether larger values or smaller values of our test statistic favor the alternative hypothesis.

In [ ]:

p_val = ...
p_val

In [ ]:

grader.check("q1_7")

Question 1.8. Assign the variable q1_8 to a list of all the true statements below.

We accept the null hypothesis at the 0.05 significance level.
We reject the null hypothesis at the 0.01 significance level.
We fail to reject the null hypothesis at the 0.01 significance level.
We accept the null hypothesis at the 0.01 significance level.
We fail to reject the null hypothesis at the 0.05 significance level.
We reject the null hypothesis at the 0.05 significance level.

In [ ]:

q1_8 = ...

In [ ]:

grader.check("q1_8")

Question 1.9. Suppose in this question you had shuffled the 'price_per_mile' column instead and kept the 'app' column in the same order. Assign q1_9 to either 1, 2, 3, or 4, corresponding to the true statement below.

The new p-value from shuffling 'price_per_mile' would be $1 - p$ , where $p$ is the old p-value from shuffling 'app' (i.e. your answer to Question 1.7).
We would need to change our null hypothesis in order to shuffle the 'price_per_mile' column.
There would be no difference in the conclusion of the test if we had shuffled the 'price_per_mile' column instead.
The 'price_per_mile' column cannot be shuffled because it contains numbers.

In [ ]:

q1_9 = ...

In [ ]:

grader.check("q1_9")

Question 1.10. Which of the following choices best describes the purpose of shuffling one of the columns in our dataset in a permutation test? Assign q1_10 to either 1, 2, 3, or 4.

Shuffling allows us to generate new data under the null hypothesis, which we can use in testing our hypothesis.
Shuffling mitigates noise in our data by generating new permutations of the data.
Shuffling is a special case of bootstrapping and allows us to produce interval estimates.
Shuffling allows us to generate new data under the alternative hypothesis, which explains that the data come from different distributions.

In [ ]:

q1_10 = ...

In [ ]:

grader.check("q1_10")

2. Cerealsly Sugary Percentiles 🥣

Percentiles associate numbers in a dataset to their positions when the dataset is sorted in ascending order. You may be familiar with the idea of percentiles from height and weight measurements at the doctor's office, or from standardized test scores.

There are many different ways to precisely define a percentile. In Lecture 19, we saw two different approaches:

Using a mathematical definition (see the slide in Lecture 19 titled How to calculate percentiles using mathematical definition).
Using np.percentile.

In Questions 2.1 through 2.4, we will use the mathematical definition, and in Question 2.5, we will use np.percentile.

The file cereal.csv contains some nutritional information on different breakfast cereals. The data comes from Kaggle. The 'mfr' column uses abbreviations for the manufacturer:

'A': American Home Food Products
'G' General Mills
'K': Kelloggs
'N': Nabisco
'P': Post
'Q': Quaker Oats
'R': Ralston Purina

Other columns are:

'name': the name of the cereal
'type': C for cold cereals, H for hot cereals
'calories': calories per serving
'protein': grams of protein
'fat': grams of fat
'sodium': milligrams of sodium
'fiber': grams of dietary fiber
'carbo': grams of complex carbohydrates
'sugars': grams of sugars
'potass': milligrams of potassium
'vitamins': vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
'shelf': display shelf (1, 2, or 3, counting from the floor)
'weight': weight in ounces of one serving
'cups': number of cups in one serving
'rating': a nutritional rating of the cereal (higher means more nutritious)

In [ ]:

cereal = bpd.read_csv('data/cereal.csv')
cereal

Question 2.1. Pick the best choice of bins below for a histogram showing the distribution of 'rating', then create the histogram.

Use one of the following:

rating_bins = np.arange(0, 50, 10)
rating_bins = np.arange(0, 120, 10)
rating_bins = np.arange(0, 150, 50)
rating_bins = np.arange(0, 1000, 10)

In [ ]:

rating_bins = ...

# Now create a density histogram showing the distribution of rating using rating_bins

Consider only the cereals that were manufactured by General Mills ('G') and have more than 7 grams of sugar.

In [ ]:

general_mills_sugary_cereals = (cereal[(cereal.get('mfr') == 'G') & (cereal.get('sugars') > 7)])
general_mills_sugary_cereals

Let's extract the 'ratings' data for these cereals and store them as an array called general_mills_sugary_ratings. We'll sort the array, too.

In [ ]:

general_mills_sugary_ratings = np.array(general_mills_sugary_cereals.get('rating'))
general_mills_sugary_ratings = np.sort(general_mills_sugary_ratings)
general_mills_sugary_ratings

Question 2.2. Calculate the 80th percentile of general_mills_sugary_ratings using the mathematical definition given in Lecture 19. That is:

Set n to be the number of elements in general_mills_sugary_ratings.
Set k to be the smallest integer greater than $\frac {80}{100} \cdot n$ .
Assign the 80th percentile of the array general_mills_sugary_ratings to general_mills_sugary_80th.

You must use the variables provided for you when solving this problem. For this problem, do not use np.percentile.

In [ ]:

n = ...
k = ...

# Don't change this line. In order to proceed, k needs to be stored as an int, not a float.
# This line is not changing the mathematical value of k, just how it is stored.
k = int(k)

general_mills_sugary_80th = ...
general_mills_sugary_80th

In [ ]:

grader.check("q2_2")

Question 2.3. Now we'll compare the 80th percentile of the ratings of sugary General Mills cereals with the 80th percentile of the ratings of sugary cereals manufactured by all companies other than General Mills.

Create a DataFrame called sugary_cereals containing only the cereals with a 'mfr' that isn't 'G', with more than seven grams of sugar. Calculate the 80th percentile of ratings for these cereals, using the same mathematical procedure, and assign to the variable absolute_difference the absolute difference in the 80th percentile of ratings for sugary General Mills cereals and all other sugary cereals.

As before, use the variables provided and do not use np.percentile.

Hint: Remember to sort the ratings using np.sort before computing percentiles.

In [ ]:

sugary_cereals = ...

n_2 = ...
k_2 = ...

k_2 = int(k_2) # Don't change this.

sugary_80th = ...

absolute_difference = ...
absolute_difference

In [ ]:

grader.check("q2_3")

Question 2.4. Say that General Mills wants to create a new sugary cereal for UCSD students called the "U Cereal SD" in honor of UCSD’s record enrollment of 42,968 students this year. In a strange coincidence, the cereal got a nutritional rating of exactly 42.968!

Consider a new collection of values, containing all the values in general_mills_sugary_ratings, plus one more, 42.968:

In [ ]:

new_collection = np.append(general_mills_sugary_ratings, 42.968)
new_collection = np.sort(new_collection)
new_collection

For what integer values of $p$ would we be able to say that this new collection of values has 42.968 as its $p$ th percentile? Create a list called percentile_range of all integer values of $p$ such that the $p$ th percentile of the new collection equals 42.968, according to the mathematical definition of percentile.

This is a math question, not a coding question. You should create the list percentile_range manually, by solving a math problem on paper and inputting your answer in the form of a Python list.

Do not use np.percentile.

In [ ]:

percentile_range = ...

In [ ]:

grader.check("q2_4")

Question 2.5. The first quartile of a numerical collection is the 25th percentile, the second quartile is the 50th percentile, and the third quartile is the 75th percentile. Quartiles are so named because they divide the collection into quarters.

Make a list called sugar_quartiles that contains the values for the first, second, and third quartiles (in that order) of the 'sugars' data provided in cereal. For this problem, calculate the percentiles using np.percentile.

In [ ]:

sugar_quartiles = ...
sugar_quartiles

In [ ]:

grader.check("q2_5")

3. Live Crystal Scoops 🔮

(source)

Over the last year, live crystal scoops have become popular on TikTok. There are TikTok pages that collect and sell crystals, which some believe have the power to heal both the body and the mind. These pages don't sell crystals individually, but rather they "scoop" a random collection of their inventory, put the collected crystals in a bag, and send that bag to the customer. What makes them live crystal scoops is that these pages typically livestream the act of scooping these crystals for every order they receive and include the order number in the stream, so that customers can verify that what they receive is actually what was scooped. For instance, @chloesmith.uk is one such page.

Last night, you were scrolling endlessly on TikTok, and came across crystal scooping livestreams by two accounts, Scoops by Shelly and Crystals by Cathy. Both are selling scoops for $29.99. Intrigued, you decide to order a scoop from Shelly, and in the livestream it seems that you pulled a hefty scoop. When your order is finally delivered, however, you're disappointed to find that the total weight of the crystals you received is much lower than what you expected given what you saw on the livestream. Should you have purchased a scoop from Cathy instead?

Question 3.1. Ideally, you want to determine the mean weight of all scoops from Scoops by Shelly. However, it's not feasible to do so, because her scoops are very expensive and she has many other customers. Instead, you will collect a sample of scoops to obtain a ____________ statistic to estimate this ____________ parameter.

Complete the sentence above by filling in the blanks. Set q3_1 to 1, 2, 3, or 4.

population; sample
sample; population
test; population
test; sample

In [ ]:

q3_1 = ...

In [ ]:

grader.check("q3_1")

Fortunately, you have an incredible crystal resource at your disposal, the Crystals Live Share Group on Facebook. You make a post and ask the members who've bought scoops from Scoops by Shelly and Crystals by Cathy to weigh their packages in grams. You're overwhelmed by the amazing community response and receive 80 different scoop weights in total from other buyers, 40 from Scoops by Shelly buyers and 40 from Crystals by Cathy buyers.

Let's look at all the data that you crowdsourced. Each entry in the 'Weight' column represents the weight of one scoop, in grams.

In [ ]:

crystal_weights = bpd.read_csv('data/crystals.csv')
crystal_weights

Question 3.2. To start, we'll look at only the scoops in our sample from Scoops by Shelly. Below, assign shelly_scoops to a DataFrame with only the scoops from Scoops by Shelly. Then, assign shelly_mean to the mean weight of the Scoops by Shelly scoops in our sample.

In [ ]:

shelly_scoops = ...
shelly_mean = ...
shelly_mean

In [ ]:

grader.check("q3_2")

You're done! Or are you? You have a single estimate for the true mean weight of Shelly's scoops. However, you don't know how close that estimate is, or how much it could have varied if you'd had a different sample. In other words, you have an estimate, but no understanding of how close that estimate is to the true mean weight of all of Shelly's scoops.

This is where the idea of resampling via bootstrapping comes in. Assuming that our sample resembles the population fairly well, we can resample from our original sample to produce more samples. From each of these resamples, we can produce another estimate for the true mean weight, which gives us a distribution of sample means that describes how the estimate might vary given different samples. We can then use this distribution to produce an interval that estimates the true mean weight of Shelly's scoops.

Question 3.3. Complete the following code to produce 1000 bootstrapped estimates for the mean weight of Shelly's scoops. Store your 1000 estimates in an array called resample_means.

In [ ]:

resample_means = ...
for i in np.arange(1000):
    resample = ...
    resample_mean = ...
    resample_means = ...
resample_means

In [ ]:

grader.check("q3_3")

Let's look at the distribution of your estimates:

In [ ]:

bpd.DataFrame().assign(BootstrappedMeans = resample_means).plot(kind='hist', density=True, ec='w', bins=20, figsize=(10, 5));

Question 3.4. Using the array resample_means, compute an approximate 95% confidence interval for the true mean weight of Shelly's scoops. Save the lower and upper bounds of the interval as shelly_lower_bound and shelly_upper_bound, respectively.

Hint: Use np.percentile.

In [ ]:

shelly_lower_bound = ...
shelly_upper_bound = ...

#: the confidence interval
print("Bootstrapped 95% confidence interval for the true mean weight of Shelly's scoops: [{:f}, {:f}]".format(shelly_lower_bound, shelly_upper_bound))

In [ ]:

grader.check("q3_4")

Question 3.5. Which of the following would likely make the histogram from Question 3.3 wider? If you believe more than one would, choose the answer with the most substantial effect. Assign to q3_5 either 1, 2, 3, or 4.

Starting with a larger sample of 100 scoops.
Starting with a smaller sample of 20 scoops.
Decreasing the number of resamples (repetitions of the bootstrap) to 500.
Increasing the number of resamples (repetitions of the bootstrap) to 3000.

In [ ]:

q3_5 = ...
q3_5

In [ ]:

grader.check("q3_5")

Question 3.6. Suppose you want to estimate the weight of the lightest scoop Shelly has ever scooped, her biggest scam. Would bootstrapping be effective in estimating this weight? Assign bootstrapping_effective to either True or False, representing your answer.

In [ ]:

bootstrapping_effective = ...

In [ ]:

grader.check("q3_6")

Question 3.7. Now let's address a different question: how does the average weight of a Scoops by Shelly scoop compare to the average weight of a Crystals by Cathy scoop? Create a DataFrame called cathy_scoops that contains only the weights of scoops from Crystals by Cathy, and set cathy_mean equal to the mean weight of Cathy's scoops as you did for Scoops by Shelly in Question 3.2. Then, set observed_diff_mean to the difference in mean scoop weight for the Shelly and Cathy's scoops in our sample.

\text{difference} = \text{mean weight of Shelly's scoops} - \text{mean weight of Cathy's scoops}

In [ ]:

cathy_scoops = ...
cathy_mean = ...
observed_diff_mean = ...
observed_diff_mean

In [ ]:

grader.check("q3_7")

If you completed Question 3.7 correctly, you should have found that Shelly and Cathy's mean scoop weights were quite different. Remember, all we have access to are samples of size 40 from each seller. Would we see this large of a difference if we had access to the population – that is, the weights of all scoops ever produced by both sellers – or was it just by chance that our samples displayed this difference? Let's do a hypothesis test to find out. We'll state our hypotheses as follows:

Null Hypothesis: The mean weight of scoops from Scoops by Shelly is equal to the mean weight of scoops from Crystals by Cathy. Equivalently, the difference in the mean scoop weight for the two sellers equals 0 grams.
Alternative Hypothesis: The mean weight of scoops from Scoops by Shelly is not equal to the mean weight of scoops from Crystals by Cathy. Equivalently, the difference in the mean scoop weight for the two sellers does not equal 0 grams.

Since we were able to set up our hypothesis test as a question of whether a certain population parameter – the difference in mean scoop weight for Scoops by Shelly and Crystals by Cathy – is equal to a certain value, we can test our hypotheses by constructing a confidence interval for the parameter. This is the method we used in Lecture 20. You can read more about conducting a hypothesis test with a confidence interval in CIT 13.4.

Note: We are not conducting a permutation test here, although that would also be a valid approach to test these hypotheses.

Question 3.8. Compute 1000 bootstrapped estimates for the difference in the mean scoop weight for Scoops by Shelly and Crystals by Cathy. As in Question 3.7, do Shelly minus Cathy. Store your 1000 estimates in the difference_means array.

You should generate your Shelly resamples by sampling from shelly_scoops, and your Cathy resamples by sampling from cathy_scoops. You should not use crystal_weights at all.

In [ ]:

np.random.seed(23) # Ignore this, and don't change it.

difference_means = ...

# Just display the first ten differences.
difference_means[:10]

In [ ]:

grader.check("q3_8")

Let's visualize your estimates:

In [ ]:

bpd.DataFrame().assign(BootstrappedDifferenceMeans = difference_means).plot(kind = 'hist', density=True, ec='w', bins=20, figsize=(10, 5));

Question 3.9. Compute a 95% confidence interval for the difference in mean weights of Shelly and Cathy's scoops (as before, Shelly minus Cathy). Assign the left and right endpoints of this confidence interval to left_endpoint and right_endpoint respectively. Use np.percentile to find the endpoints.

In [ ]:

left_endpoint = ...
right_endpoint = ...

print("Bootstrapped 95% confidence interval for the mean difference in weights of Shelly and Cathy's scoops:\n [{:f}, {:f}]".format(left_endpoint, right_endpoint))

In [ ]:

grader.check("q3_9")

Question 3.10. Based on the confidence interval you've created, would you reject the null hypothesis at the 0.05 significance level? Set reject_null to True if you would reject the null hypothesis, and False if you would not.

In [ ]:

reject_null = ...

In [ ]:

grader.check("q3_10")

Question 3.11. What if the Facebook group members had recorded all of their scoop weights in pounds instead of grams? Would your hypothesis test still come to the same conclusion either way? Set same_conclusion to True or False.

In [ ]:

same_conclusion = ...

In [ ]:

grader.check("q3_11")

4. Cheese, Please 🧀

You work for a small grocery store. You are interested in determining which types of cheese to stock in your store, so you survey 500 randomly-selected shoppers and ask which type of cheese they prefer the most among four options – 'Brie', 'Cheddar', 'Feta', 'Mozzarella'. You also record some indecisive shoppers as 'Undecided'.

Run the next cell to load in the results of the survey.

In [ ]:

cheese = bpd.read_csv('data/cheese.csv')
cheese.reset_index().groupby('cheese').count()

Assume that your sample is a uniform random sample of the population of grocery store shoppers. Below, we compute the proportion of shoppers in your sample that prefer each type of cheese.

In [ ]:

cheese.assign(counts=cheese.get('cheese')).groupby('cheese').count().get('counts') / cheese.shape[0]

What you're truly interested in, though, is the proportion of all shoppers that prefer each type of cheese. These are population parameters (plural, because there are 5 proportions).

In this question, we will start by computing a confidence interval for the true proportion of shoppers that prefer 'Mozzarella', and then later compute a confidence interval for the true difference in proportions of shoppers that prefer 'Mozzarella' over 'Brie'.

Below, we have given you code that computes 1000 bootstrapped estimates of the true proportion of shoppers who prefer 'Mozzarella' cheese over the other options. Run the next cell to calculate these estimates and display a histogram of their values.

In [ ]:

def proportions_in_resamples():
    np.random.seed(55) # Ignore this, and don't change it.
    num_shoppers = cheese.shape[0]
    proportions = np.array([])
    for i in np.arange(1000):
        resample = cheese.sample(num_shoppers, replace = True)
        resample_proportion = np.count_nonzero(resample.get('cheese') == 'Mozzarella') / num_shoppers
        proportions = np.append(proportions, resample_proportion)
    return proportions

boot_mozzarella_proportions = proportions_in_resamples()
bpd.DataFrame().assign(Estimated_Proportion_Mozzarella=boot_mozzarella_proportions).plot(kind='hist', density=True, ec='w', figsize=(10, 5));

Question 4.1. Using the array boot_mozzarella_proportions, compute an approximate 99% (not 95%) confidence interval for the true proportion of shoppers who prefer 'Mozzarella' cheese. Compute the lower and upper ends of the interval, named mozzarella_lower_bound and mozzarella_upper_bound, respectively.

Note: As we did in lecture, use np.percentile whenever computing confidence intervals.

In [ ]:

mozzarella_lower_bound = ...
mozzarella_upper_bound = ...

# Print the confidence interval:
print("Bootstrapped 99% confidence interval for the true proportion of shoppers who prefer Mozzarella cheese in the population:\n[{:f}, {:f}]".format(mozzarella_lower_bound, mozzarella_upper_bound))

In [ ]:

grader.check("q4_1")

Question 4.2. Is it true that 99% of the population lies in the range mozzarella_lower_bound to mozzarella_upper_bound? Assign the variable q4_2 to either True or False.

In [ ]:

q4_2 = ...

In [ ]:

grader.check("q4_2")

Question 4.3. Is it true that the true proportion of shoppers who prefer 'Mozzarella' over the other cheeses is a random quantity with approximately a 99% chance of falling between mozzarella_lower_bound and mozzarella_upper_bound? Assign the variable q4_3 to either True or False.

In [ ]:

q4_3 = ...

In [ ]:

grader.check("q4_3")

Question 4.4. Suppose we were somehow able to produce 20,000 new samples, each one a uniform random sample of 500 shoppers taken directly from the population. For each of those 20,000 new samples, we create a 99% confidence interval for the proportion of shoppers who prefer 'Mozzarella'. Roughly how many of those 20,000 intervals should we expect to actually contain the true proportion of the population? Assign your answer to the variable how_many below. It should be of type int, representing the number of intervals, not the proportion or percentage.

In [ ]:

how_many = ...
how_many

In [ ]:

grader.check("q4_4")

Question 4.5. We also created 90%, 95%, and 99.9% confidence intervals from one sample (shown below), but forgot to label which confidence intervals were which! Match the interval to the percent of confidence the interval represents and assign your choices (either 1, 2, or 3) to variables ci_90, ci_95, and ci_999, corresponding to the 90%, 95%, and 99.9% confidence intervals respectively.

Hint: Drawing the confidence intervals out on paper might help you visualize them better.

$[0.259, 0.389]$
$[0.286, 0.358]$
$[0.280, 0.362]$

In [ ]:

ci_90 = ...
ci_95 = ...
ci_999 = ...
ci_90, ci_95, ci_999

In [ ]:

grader.check("q4_5")

Question 4.6. Based on the survey results shown at the start of the question, it seems that 'Mozzarella' is more popular than 'Brie' among shoppers. We would like to construct a range of likely values – that is, a confidence interval – for the difference in popularity, which we define as:

\text{(Proportion of shoppers who prefer Mozzarella)} - \text{(Proportion of shoppers who prefer Brie)}

Create a function, differences_in_resamples, that creates 1000 bootstrapped resamples of the original survey data in the cheese DataFrame, computes the difference in proportions for each resample, and returns an array of these differences. Store your bootstrapped estimates in an array called boot_differences and plot a histogram of these estimates.

Note: While this might sound like a job for permutation testing, this is instead a bootstrapping question. Note that our goal is to estimate a population parameter – the difference between the proportion of all shoppers that prefer Mozzarella and the proportion of all shoppers that prefer Brie – not to answer a question about whether two samples come from the same distribution.

Hint: Use the code for proportions_in_resamples given to you above as a starting point.

In [ ]:

def differences_in_resamples():
    np.random.seed(55) # Ignore this, and don't change it.
    ...

boot_differences = ...

# Plot a histogram of boot_differences.

In [ ]:

grader.check("q4_6")

Question 4.7. Compute an approximate 99% confidence interval for the difference in proportions. Assign the lower and upper bounds of the interval to diff_lower_bound and diff_upper_bound, respectively.

In [ ]:

diff_lower_bound = ...
diff_upper_bound = ...

# Print the confidence interval:
print("Bootstrapped 99% confidence interval for the difference in popularity between Mozzarella and Brie:\n[{:f}, {:f}]".format(diff_lower_bound, diff_upper_bound))

In [ ]:

grader.check("q4_7")

Question 4.8. In this question, you computed two 99% confidence intervals:

In Question 4.1, you found a 99% confidence interval for the proportion of shoppers who prefer 'Mozzarella' among the four cheese options. Let's call this the "mozzarella CI."
In Question 4.7, you found a 99% confidence interval for the difference between the proportion of shoppers who prefer 'Mozzarella' and the proportion of shoppers who prefer 'Brie'. Let's call this the "difference CI."

Which of the explanations below best describes the widths of these two confidence intervals? Set q4_8 to either 1, 2, 3, or 4.

The mozzarella CI is wider than the difference CI because we have more certainty in an estimate of a single unknown parameter than in the difference between two unknown parameters.
The mozzarella CI is wider than the difference CI because we have less certainty in an estimate of a single unknown parameter than in the difference between two unknown parameters.
The mozzarella CI is narrower than the difference CI because we have more certainty in an estimate of a single unknown parameter than in the difference between two unknown parameters.
The mozzarella CI is narrower than the difference CI because we have less certainty in an estimate of a single unknown parameter than in the difference between two unknown parameters.

In [ ]:

q4_8 = ...

In [ ]:

grader.check("q4_8")

Finish Line 🏁

Congratulations! You are done with Homework 6 – the second-to-last homework of the quarter!

To submit your assignment:

Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells.
Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.

In [ ]:

# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()