GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/homeworks/hw07/hw07.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

Homework 7: Confidence Intervals, the Normal Distribution, and the Central Limit Theorem

Due Tuesday, November 22nd at 11:59PM

Welcome to Homework 7! This week, we will cover confidence intervals, the normal distribution, and the Central Limit Theorem. You can find additional help on these topics in the following readings:

CIT 13.3: Confidence Intervals
CIT 13.4: Using Confidence Intervals
CIT 14.2: Variability, Standard Deviation, Standard Units, Chebyshev's Bounds
CIT 14.3: The Standard Deviation (SD) and the Normal Curve
CIT 14.4: The Central Limit Theorem
CIT 14.5: The Variability of the Sample Mean
CIT 14.6: Choosing a Sample Size

Instructions

This assignment is due Tuesday, November 22nd at 11:59PM. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

Important: For homeworks, the otter tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the Calendar) or EdStem. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

In [ ]:

# Please don't change this cell, but do make sure to run it
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')

import otter
grader = otter.Notebook()

from IPython.display import IFrame
def show_clt_slides():
    src = "https://docs.google.com/presentation/d/e/2PACX-1vTcJd3U1H1KoXqBFcWGKFUPjZbeW4oiNZZLCFY8jqvSDsl4L1rRTg7980nPs1TGCAecYKUZxH5MZIBh/embed?start=false&loop=false&delayms=3000"
    width = 700
    height = 370
    display(IFrame(src, width, height))

1. Comparing Video Game Sales 🎮

Suppose you’re a big video game fan, and you’re bored of playing all the games you have, so it’s time for a change. You and your friends agree to only play one video game genre for the next few weeks, but are unsure of which genre to choose. Luckily, you have a data set on video game sales, which includes data on the games’ genres. You’re interested in seeing which genres have the highest sales since these are probably the genres that are more popular.

The DataFrame below corresponds to a sample of video games. Each row corresponds to a particular video game. We have information on the 'Name' of the game, the 'Platform' it's played on, the 'Genre', the 'Publisher', and the 'Sales' in millions of dollars. Now it’s time to analyze the popularity for each genre of video game!

In [ ]:

vg_sales = bpd.read_csv('data/vgsales.csv')
vg_sales

Question 1.1. Let's start by determining the mean sales for each genre. Create a DataFrame called genre_means, indexed by 'Genre', with a 'Sales' column that contains the mean sales for each genre, in millions of dollars. Sort the genres in descending order of 'Sales'.

In [ ]:

genre_means = ...
genre_means

In [ ]:

grader.check("q1_1")

Question 1.2. The 'Platform' genre (not to be confused with the 'Platform' column!) seems to have a pretty high mean sales figure based on the data we have access to. However the data we have access to is only a sample of all video games ever created, and thus the mean sales figure for the 'Platform' genre computed above is only a sample statistic, not a population parameter.

Produce 1,000 bootstrapped estimates for the mean sales of all games in the genre 'Platform', in millions of dollars. Store the estimates in the platform_averages array. Then, use the platform_averages array to calculate an approximate 99% confidence interval for the true mean sales, in millions of dollars. Assign the endpoints of your interval to lower_bound and upper_bound.

In [ ]:

platform_averages = ...


lower_bound = ...
upper_bound = ...

# Display the estimates in a histogram.
bpd.DataFrame().assign(Estimated_Average_Sales=platform_averages).plot(kind='hist', density=True, ec='w', figsize=(10, 5), title="Platform");
plt.plot([lower_bound, upper_bound], [0, 0], color='gold', linewidth=10, label='99% confidence interval');

# Don't change the line below (though you will need to copy and change it in 1.3)
genre_name = 'Platform'
f'A 99% confidence interval for average sales of {genre_name} video games is [{lower_bound}, {upper_bound}]'

In [ ]:

grader.check("q1_2")

Question 1.3. You want to create a similar histogram for each of the other genres, and also calculate the corresponding confidence intervals, but repeating the process above 11 times would be time-consuming.

Create a function called ci_and_hist, which takes in a video game genre as a string, and:

Plots the histogram of 1,000 bootstrapped estimates for the genre's mean sales.
Returns a string describing the approximate 99% confidence interval for the genre's mean sales, formatted in the same way as the string displayed for 'Platform' in Question 1.2.

Start with the code from 1.2 and generalize it to work for any genre.

Notes:

Make sure your function both plots a histogram and returns a string. For example, ci_and_hist('Racing') should return a string that starts with 'A 99% confidence interval for average sales of Racing video games is'.
The string displayed at the end of 1.2 was created using a feature of Python called f-strings. You'll need to copy and change that f-string expression. Read this article for more details about f-strings.

In [ ]:

def ci_and_hist(genre_name):
    ...
    
# Example calls to the function. Don't change the lines below.
fighting_string = ci_and_hist('Fighting')
print(fighting_string)
racing_string = ci_and_hist('Racing')
print(racing_string)

In [ ]:

grader.check("q1_3")

Question 1.4. Your friend claims that games of the 'Strategy' genre are actually more popular than the data depicts. In our sample, the mean sales for the 'Strategy' genre is about 0.26 million. She claims that since our sales data is only a sample of the full population of games, the actual mean sales for the 'Strategy' genre could be 0.36 million. You decide to perform a hypothesis test for the following pair of hypotheses:

Null Hypothesis: The mean sales for the 'Strategy' genre is 0.36 million.
Alternative Hypothesis: The mean sales for the 'Strategy' genre is not 0.36 million.

Run the cell below to use the ci_and_hist function you defined above to calculate an approximate 99% confidence interval for the mean sales of the 'Strategy' genre.

In [ ]:

ci_and_hist('Strategy')

Do you reject the null hypothesis at a 0.01 p-value cutoff? Assign 1, 2, 3, or 4 to q1_4.

No, because the confidence interval includes 0.36.
No, because the confidence interval doesn't include 0.36.
Yes, because the confidence interval includes 0.36.
Yes, because the confidence interval doesn't include 0.36.

In [ ]:

q1_4 = ...

In [ ]:

grader.check("q1_4")

2. Testing the Central Limit Theorem: Coin Flips and Midterm Scores 💯

The Central Limit Theorem tells us that the probability distribution of the sum or mean of a large random sample drawn with replacement is roughly normal, regardless of the distribution of the population from which the sample is drawn.

That's a pretty big claim, but the theorem doesn't stop there. It further states that, if we're using the mean as our statistic, the standard deviation of this normal distribution is given by $\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$

In other words, suppose we start with any distribution that has standard deviation $\sigma$ , take a sample of size $n$ (where $n$ is a large number) from that distribution with replacement, and compute the mean of that sample. If we repeat this procedure many times, then those sample means will have a normal distribution with standard deviation $\frac{\sigma}{\sqrt{n}}$ .

That's an even bigger claim than the first one! The proof of the theorem is beyond the scope of this class, but we've seen examples in lecture of this formula in action, such as when we looked at flight delay data.

Run the cell below to see a short presentation that describes the CLT at a high level.

In [ ]:

show_clt_slides()

In this exercise, we will be exploring some data to see the CLT in action.

Question 2.1. The CLT only applies when sample sizes are "sufficiently large." This isn't a very precise statement. Is 10 large? How about 50? The truth is that it depends both on the original population distribution and just how "normal" you want the result to look. Let's use a simulation to get a feel for how the distribution of the sample mean changes as the sample size increases.

Consider a coin flip. If we say heads is $1$ and tails is $0$ , then there's a 50% chance of getting a $1$ and a 50% chance of getting a $0$ , which is definitely not a normal distribution. The mean of these $1$ s and $0$ s for several coin tosses is equal to the proportion of heads in those coin tosses, so the CLT should apply if we compute the sample proportion of heads many times.

Write a function called simulate_sample_n that takes in a sample size n. It should repeat, 5000 times, the process of:

simulating n flips of a fair coin, and
counting the proportion of flips that were heads.

simulate_sample_n should return an array that contains 5000 sample proportions, using the process outlined above.

In [ ]:

def simulate_sample_n(n):
    ...
simulate_sample_n(5)

In [ ]:

grader.check("q2_1")

The code below will use the function you just defined to plot the empirical distribution of the sample mean for several different sample sizes. We saw something similar in Lecture 22.

In [ ]:

bins = np.arange(-0.01, 1.05, 0.02)

for sample_size in np.array([2, 5, 10, 20, 50, 100, 200, 400]):
    bpd.DataFrame().assign(**{'Sample_Size:{}'.format(sample_size) : simulate_sample_n(sample_size)}) \
                   .plot(kind='hist', density=True, ec='w', bins=bins, 
                         title=f'Sample Size {sample_size}', legend=None, figsize=(5, 3));
    plt.xlim(-0.01, 1.05)
    plt.ylim(0, 25);

You can see that even for samples of size 10, the distribution of sample proportions looks roughly bell-shaped. When we increase the sample size to 50, the resulting distribution looks quite bell-shaped. Note also that as the sample sizes increases, the distributions of sample proportions become narrower.

Now we will test the second claim of the CLT: that the SD of the distribution of the sample mean is the SD of the original distribution, divided by the square root of the sample size.

\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}

Below, we will read in the scores of this quarter's Midterm Exam (which we have modified slightly for anonymity). We'll treat this DataFrame as our population, and we'll take samples directly from it. We've computed the standard deviation of the midterm scores for you; you will need to use it in the next question.

In [ ]:

midterm = bpd.read_csv('data/fa22-midterm-scores.csv')
midterm

In [ ]:

midterm_std = np.std(midterm.get('Score'))
midterm_std

Question 2.2. Write a function called predict_sd that takes in a sample size n. It returns the predicted standard deviation (according to the CLT) of the sample mean's distribution, for samples of size n taken from the midterm data.

Hint: Do not use simulate_sample_n.

In [ ]:

def predict_sd(n):
    ...

predict_sd(10)

In [ ]:

grader.check("q2_2")

Question 2.3. Write a function called empirical_sd that takes in a sample size n, draws 1,000 samples of size n from the midterm scores data set with replacement, and returns the standard deviation of the distribution of the sample means of those 1,000 samples.

Hint: This function will be similar to the simulate_sample_n function you wrote earlier.

In [ ]:

def empirical_sd(n): 
    sample_means = np.array([])
    ...
    return np.std(sample_means)
empirical_sd(10)

In [ ]:

grader.check("q2_3")

The cell below will plot the predicted SDs (computed by your predict_sd function) and empirical SDs (computed by your empirical_sd function) for various sample sizes. It may take a few moments to run.

In [ ]:

sd_df = bpd.DataFrame().assign(Sample_Size = np.arange(1, 101, 10))
predicted = sd_df.get('Sample_Size').apply(predict_sd)
empirical = sd_df.get('Sample_Size').apply(empirical_sd)
sd_df = sd_df.assign(Predicted_SD = predicted, Empirical_SD = empirical)
ax = sd_df.plot(kind='scatter',x='Sample_Size', y='Empirical_SD',label='Empirical_SD', color='red', alpha=0.6, s=200, figsize=(10, 5));
ax = sd_df.plot(kind='scatter',x='Sample_Size', y='Predicted_SD',label='Predicted_SD', color='blue', alpha=0.6, s=200, ax=ax)
ax.set_ylabel('Standard Deviation');

It appears that the formula $\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$ matches what we see in practice!

3. UCSD's Housing Crisis 🏠

In April 2021, UCSD's Housing Dining Hospitality (HDH) removed triple occupancy dorm rooms and eliminated its two-year housing guarantee. With enrollments rising to a record-breaking 42,875 in Fall 2021 (and even higher in Fall 2022), this led to a housing crisis in which many students struggled to secure housing for the 2021-22 school year. In response, UCSD

directed students to an off-campus housing website,
hosted an off-campus housing webinar, and
offered an option for students to live in local hotels at a discounted rate.

A data scientist at UCSD wanted to see if students were actually satisfied with the solutions UCSD provided. She polled a uniform random sample of all UCSD students, and determined that 210 of the 700 sampled students thought UCSD's solutions were satisfactory.

In [ ]:

# Run this cell, but don't change it.
survey = bpd.DataFrame().assign(
    Opinion=np.array(["Satisfactory", "Unsatisfactory"]),
    Count=np.array([210,   490]))
sample_size = survey.get("Count").sum()
survey_results = survey.assign(
    Proportion=survey.get("Count") / sample_size)
survey_results

Next, she used 1,000 bootstrap resamples to compute a confidence interval for the proportion of all UCSD students who found the solutions satisfactory. Run the next cell to see the empirical distribution of 'Satisfactory' proportions in the 1,000 resamples.

Note that we're using np.random.multinomial to do the resampling here, since each element of the resample is either 1 (satisfactory) or 0 (unsatisfactory) with known probabilities. This accomplishes the same thing as using .sample with replace=True, but is much faster.

In [ ]:

boot_proportions = np.array([])
for i in np.arange(1000):
    resample = np.random.multinomial(sample_size, survey_results.get('Proportion')) / sample_size
    boot_proportions = np.append(boot_proportions, resample[0])
bpd.DataFrame().assign(boot_proportions = boot_proportions).plot(kind='hist', density=True, ec='w', bins=np.arange(0.15, 0.45, .01), figsize=(10,5));

Recall, the Central Limit Theorem says

\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}

Furthermore, in any collection of numbers where the only unique values are 0 and 1, there is a simple formula for the standard deviation of the collection:

\text{SD of Collection of 0s and 1s} = \sqrt{(\text{Proportion of 0s in Collection}) \times (\text{Proportion of 1s in Collection})}

Note that samples and populations are both possible examples of "collections."

(You're not responsible for deriving this formula, but if you're curious, it's possible to do so just by using the definition of standard deviation and a little algebra!)

Question 3.1. Without accessing the data in boot_proportions in any way, compute an approximation of the standard deviation of the array boot_proportions and assign it to the variable approximate_sd.

Instead of using boot_proportions directly, use both the Central Limit Theorem and the standard deviation formula above. Since you don't know the true proportions of 0s and 1s in the population, use the proportions in the sample instead (since they're likely to be similar).

In [ ]:

approximate_sd = ...
approximate_sd

In [ ]:

grader.check("q3_1")

Question 3.2. Compute the actual standard deviation of the array boot_proportions. Your answer should be close to your answer from 3.1.

In [ ]:

exact_sd = ...
exact_sd

In [ ]:

grader.check("q3_2")

Question 3.3. Still without accessing boot_proportions in any way, compute an approximate 95% confidence interval for the proportion of students that found UCSD's solutions satisfactory. The cell below grader.check("q3_3") draws your interval in gold below the histogram of boot_proportions; use that to verify that your answer looks right.

Hint: In the past, we've used np.percentile on the array of bootstrapped estimates to find the bounds for the confidence interval. Now, we're not allowed to use the bootstrapped distribution, so we can't do it that way. But we don't need to! The Central Limit Theorem tells us that the distribution of the sample mean is normal with a certain standard deviation. We also know that 95% of the area of the normal distribution falls within a certain number of standard deviations from the mean.

In [ ]:

lower_limit = ...
upper_limit = ...

# Your interval is:
[lower_limit, upper_limit]

In [ ]:

grader.check("q3_3")

In [ ]:

# Run this cell to plot your confidence interval.
bpd.DataFrame().assign(boot_proportions = boot_proportions).plot(kind='hist', density=True, ec='w', bins=np.arange(0.15, 0.45, 0.01), figsize=(10, 5));
plt.plot([upper_limit, lower_limit], [0, 0], color='gold', linewidth=10, label='Normal CI');
plt.legend();

Your confidence interval should make it clear that we're pretty confident that relatively few students were satisfied by UCSD's solutions. This makes sense, as the proportion of 'Satisfactory' opinions in the sample was only 0.30.

The data scientist is considering redoing the survey with a larger sample to estimate the population proportion of 'Satisfactory' opinions with greater precision. She would be happy if the standard deviation of the distribution of the sample mean were 0.006 (or less). She'll need to take a new sample that's large enough to achieve that. Polling is time-consuming, so the sample also shouldn't be bigger than necessary.

Instead of making the conservative assumption that the population standard deviation is 0.5 (the largest possible SD of a collection of 0s and 1s), she decides to assume that it's equal to the standard deviation of her first sample. That is,

\text{Population SD} \approx \text{Sample SD} = \sqrt{(\text{Proportion of 0s in Sample}) \times (\text{Proportion of 1s in Sample})}

Under that assumption, she computes the smallest sample size necessary in order to be confident that the standard deviation of the distribution of the sample mean is at most 0.006.

Question 3.4. What sample size did she find? Assign your answer to the variable new_sample_size, which should be of type int.

Use the fact that $\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$

Hints:

There is only one unknown in the equation above.
Think about how you should round your answer to satisfy the constraints of the problem.

In [ ]:

new_sample_size = ...
new_sample_size

In [ ]:

grader.check("q3_4")

Question 3.5. Suppose the data scientist wants to be even more precise and take a sample of sufficient size such that the standard deviation of the sample mean distribution is 0.0015. Is it possible for her to do this? Choose the best answer and explanation, then assign q3_5 to either 1, 2, 3, or 4.

Yes. She can repeat the sample again until she comes across a sample with a standard deviation of 0.0015.
Yes. Since the 0.0015 is a quarter of 0.006, the required sample size is a fourth of new_sample_size.
Yes. Since the 0.0015 is a quarter of 0.006, the required sample size is four times new_sample_size.
No, the sample size required to reach that sample mean standard deviation is larger than the number of students at UCSD.

In [ ]:

q3_5 = ...

In [ ]:

grader.check("q3_5")

By the way, UCSD eventually decided to partially reverse some of the decisions that led to the housing crisis. Now, in the 2022-23 school year, some triple-occupancy dorm rooms are again being used to house three students. UCSD even brought back the two-year housing guarantee!

4. Key Concepts 🔑

Question 4.1. How do we convert the value 116 to standard units if it comes from a data set where the mean is 133 and the standard deviation is 14? Assign q4_1 to either 1, 2, 3, or 4.

$\dfrac{({133-116})^2}{14}$

$\dfrac{116-133}{14}$

$\dfrac{133-116}{14}$

$\dfrac{{116-133}}{\sqrt{14}}$

In [ ]:

q4_1 = ...

In [ ]:

grader.check("q4_1")

Question 4.2. According to Chebyshev's inequality, for any data set, at least one quarter the data falls within how many standard deviations of the mean? Assign the smallest correct answer to q4_2.

1.00
1.16
1.28
1.50

In [ ]:

q4_2 = ...

In [ ]:

grader.check("q4_2")

Question 4.3. Assign q4_3 to a list of all statements below that are always true.

If we know the mean and SD of a distribution, we can calculate a 95% confidence interval by stepping out two standard deviations from the mean in either direction.
An empirical histogram of the sample median of a large random sample drawn with replacement from a population will be roughly normal.
An empirical histogram of the sample mean of a large random sample drawn with replacement from a population will be roughly normal.
For any distribution, at least 68% of the data falls within two standard deviations of the mean.
For any distribution, 68% of the data falls within one standard deviation of the mean.

In [ ]:

q4_3 = ...

In [ ]:

grader.check("q4_3")

Question 4.4. Consider drawing a large random sample with replacement from some population. Let $x$ be the sample size such that the standard deviation of the distribution of sample means is 0.04. What sample size is required to guarantee that the standard deviation of the distribution of sample means is no more than 0.01? Assign q4_4 to either 1, 2, 3, or 4.

$2x$
$4x$
$8x$
$16x$

In [ ]:

q4_4 = ...

In [ ]:

grader.check("q4_4")

Finish Line 🏁

Congratulations! You are done with Homework 7 – the final homework of the quarter! 🎉

To submit your assignment:

Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells.
Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.

In [ ]:

grader.check_all()