GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/labs/lab07/lab07.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

Lab 7: Center, Spread, and the Normal Distribution

Due Saturday, November 19th at 11:59PM PST

Welcome to Lab 7! In this lab you will practice calculating variance and standard deviation, and converting values to standard units. You will use these skills to compare grades in a course and you will use Chebyshev’s inequality to predict how hard the students should work in order to rank in the top 5% of the class. Finally, you will use confidence intervals to help college administrators plan for next quarter by predicting the enrollment in a new course. The topics from this lab are covered in the following readings:

CIT 13.4: Hypothesis Tests for Parameters using Confidence Intervals
CIT 14.2: Variability, Standard Deviation, Standard units, Chebyshev's Bounds.
CIT 14.3: The Standard Deviation (SD) and the Normal Curve
CIT 14.4: The Central Limit Theorem
CIT 14.5: The Variability of the Sample Mean
CIT 14.6: Choosing a Sample Size

This lab is due Saturday, 11/19 at 11:59PM.

As usual, run the cell below to prepare the lab and the automatic tests.

In [ ]:

import numpy as np
import babypandas as bpd

import matplotlib.pyplot as plt
plt.style.use('ggplot')

import otter
grader = otter.Notebook()
%reload_ext pandas_tutor

0. Comparing Grades Using Standard Units

Two of your friends, Cathy and Sam, just took their midterms. Cathy took her BILD 1 midterm and Sam took his Math 18 midterm. Cathy received a B+ on her midterm (87%) and Sam received an A- (92%). Cathy claims that while she received a lower grade on her midterm, she actually did better (relative to the rest of the class) than Sam. Sam disagrees. Knowing that you are taking DSC 10, your two friends come to you to settle their argument.

They show you two DataFrames: bild_midterm and math_midterm that represent the grades for their classes. Both exams are out of 100 points. Each DataFrame has a column called 'Student' with student ID numbers and 'Score' with the midterm scores.

Note: You do not need to make any changes to the below cell. It is for you to visualize the two datasets.

In [ ]:

# Cathy's exam
bild_midterm = bpd.read_csv("data/bild1_scores.csv")
bild_midterm.plot(y='Score', kind='hist', density=True, bins=range(0, 101, 1), ec='w', title='Distribution of BILD 1 Midterm Scores')
cathy_score = 87
print("Cathy's Score: " + str(cathy_score))

# Sam's exam
math_midterm = bpd.read_csv("data/math18_scores.csv")
math_midterm.plot(y='Score', kind='hist', density=True, bins=range(0, 101, 1), ec='w', title='Distribution of Math 18 Midterm Scores')
sam_score = 92
print("Sam's Score: " + str(sam_score))

You know that instead of comparing their actual scores, you should first convert their scores into standard units. Recall from lecture that suppose $x$ is a numerical variable, and $x_i$ is one value of the variable, the function $$ z(x_i) = \frac{\mbox{$x_i $- mean of$ xParseError: KaTeX parse error: Expected 'EOF', got '}' at position 1: }̲}{\text{SD of xParseError: KaTeX parse error: Expected 'EOF', got '}' at position 1: }̲} $converts$ x_i $to **standard units**, which represents the number of standard deviations$ x_i$ is above the mean.

To compute the midterm score in standard units for each friend, we need to:

Compute the average grade for each class. We will use the function np.mean to do this.
Compute the standard deviation (SD) of the midterm scores for each class. We could use np.std, but we will write our own function to do that.

Note that standard deviation is the square root of variance. So, we'll proceed by defining a function that computes the variance first. Recall that the variance is the mean squared deviation from the average:

\text{variance} = \frac{(\text{value}_1 - \text{average})^2 + (\text{value}_2 - \text{average})^2 +...+ (\text{value}_n - \text{average})^2}{n}

where n is the number of values (e.g. number of exam scores, in our case).

Question 0.1. Fill in the missing code to complete the function compute_variance. It takes as input an array of numbers (data) and returns the variance as a single number.

Then, use the compute_variance function to compute the variance of the two classes' midterm scores, and assign them to the two specified variable names.

Do not use np.std in your solution. Instead, use the above formula for variance as guidance.

Hint: To extract the values in a Series as an array, use .values on the Series.

In [ ]:

def compute_variance(data):
    average = ...
    diff = ...
    square_diff = ...
    sum_square_diff = ...
    n = ...
    variance = ...
    return variance

bild_midterm_var = ...
print("Variance of BILD 1 midterm: " + str(bild_midterm_var))

math_midterm_var = ...
print("Variance of Math 18 midterm: " + str(math_midterm_var))

In [ ]:

grader.check("q0_1")

Question 0.2. Now that we have a function that computes the variance, we want to write a function that computes the standard deviation. Fill in the missing code to complete the function compute_sd. It takes as input an array of numbers (data) and returns the standard deviation as a single number.

Then, use the compute_sd function to compute the standard deviation of scores of the two midterms.

Hint: Your implementation of compute_sd should only take one line, that involves both the return keyword and the function compute_variance.

In [ ]:

def compute_sd(data):
    ...

bild_midterm_sd = ...
print("Standard Deviation of BILD 1 midterm: " + str(bild_midterm_sd))

math_midterm_sd = ...
print("Standard Deviation of Math 18 midterm: " + str(math_midterm_sd))

In [ ]:

grader.check("q0_2")

Question 0.3. Now that you can compute the standard deviation, you are equipped to write a function that converts a given score to standard units. Fill in the missing code to complete the function compute_su. It takes in a score (score), the average score (avg), and the standard deviation (sd), and returns the score in standard units.

Then, use the compute_su function to transform the scores earned by each friend into standard units.

Warning: Be careful with the order of operations!

In [ ]:

def compute_su(score, avg, sd):
    standard_units = ...
    return standard_units

cathy_su = ...
print("Cathy's Score in Standard Units: " + str(cathy_su))

sam_su = ...
print("Sam's Score in Standard Units: " + str(sam_su))

In [ ]:

grader.check("q0_3")

Question 0.4. Cathy's score is higher than Sam's score when we convert to standard units, which can be seen as evidence that she did better on her exam relative to her classmates than Sam did relative to his.

Another way to measure their relative performances is to calculate, for both Cathy and Sam individually, the proportion of students they scored higher than (or the same as). Comparing Cathy's proportion to Sam's proportion will give us another way of measuring who did better relative to their classmates. Calculate Cathy's proportion and Sam's proportion below. (This will require looking at both bild_midterm and math_midterm.)

In [ ]:

cathy_proportion = ...
print("Cathy's Proportion: " + str(cathy_proportion))

sam_proportion = ...
print("Sam's Proportion: " + str(sam_proportion))

In [ ]:

grader.check("q0_4")

1. Chebyshev's Bounds and Normal Curves

Lets look at the histograms of the scores of the two midterms again.

In [ ]:

bild_midterm.plot(y='Score', kind='hist', density=True, bins=range(0, 101, 1), ec='w', title='Distribution of BILD 1 Midterm Scores');
math_midterm.plot(y='Score', kind='hist', density=True, bins=range(0, 101, 1), ec='w', title='Distribution of Math 18 Midterm Scores');

Question 1.1. Which of the two graphs roughly resembles a normal curve? Assign the variable q1_1 to either 1, 2, 3, or 4.

Only the first graph (distribution of BILD 1 midterm scores) is roughly normal.
Only the lower graph (distribution of Math 18 midterm scores) is roughly normal.
Both graphs are roughly normal.
Neither graph is roughly normal.

Remember all normal curves have the following characteristics:

The mean (average) is always in the center of a normal curve.
A normal curve has only one mode (peak).

In [ ]:

q1_1 = ...
q1_1

In [ ]:

grader.check("q1_1")

Question 1.2. By looking at the distribution of Math 18 midterm scores above, rank the following values in order from smallest to largest.

The mean score.
The median score.
The most common score (the mode).

Set variable q1_2 to a list containing the numbers 1, 2, 3 in the appropriate order. Don't compute any of these values manually!

In [ ]:

q1_2 = ...
q1_2

In [ ]:

grader.check("q1_2")

Recap: Chebyshev's inequality (i.e. Chebyshev's bounds)

Chebyshev's inequality states that no matter what the shape of the distribution is, the proportion of the values that fall in the range

\mbox{average} \pm z \mbox { Standard Deviations}

is at least $1 - \frac{1}{z^{2}}$

It's important to note that these are lower bounds, not approximations: 75% of the data is guaranteed to lie within plus or minus of 2 standard deviations of the mean, but 100% of the data might also lie within plus or minus 2 standard deviations of the mean.

On the other hand...

If we know that our data forms a normal curve, the standard deviation is even more informative.

Percent in Range	All Distributions (via Chebyshev's Inequality)	Normal Distributions
$\text{average} \pm 1 \ \text{SD}$	$\geq 0\%$	$\approx 68\%$
$\text{average} \pm 2\text{SDs}$	$\geq 75\%$	$\approx 95\%$
$\text{average} \pm 3\text{SDs}$	$\geq 88\%$	$\approx 99.73\%$

Note that for a normal distribution, the numbers in the last column of the table above are approximations, not lower bounds.

If the distribution is perfectly normal, then 68% of the data (not more, not less) will lie between plus and minus one standard deviation of the mean.
Additionally because a normal curve is symmetric, we know that 34% of the data lies between the average and the average plus one standard deviation.

Question 1.3. Cathy, who is Pre-Med, really wanted to score in the top 5% of the class. But before taking the exam, she did not know if the scores would be normally distributed or not.

Without making any assumptions about the distribution of scores, how many standard deviations above the mean would she have needed to score to guarantee that she fell in the top 5% of the the class? Set the variable q1_3 to either 1, 2, 3, or 4, depending on your answer.

Cathy would need to score roughly 4.5 standard deviations above the average to guarantee being in the top five percent. Using Chebyshev's inequality, setting $z = \sqrt{20} \approx 4.5$ gives that 95% of the data will lie between plus or minus 4.5 SDs. If Cathy scores above 4.5 SDs, then she is guaranteed to have scored better than 95% of the other students.
Cathy would need to score above 2 SDs. Since 95% of the data falls between plus or minus 2 SDs, if Cathy scores above 2 SDs, she is guaranteed to score above 95% of the class.
Cathy would need to score slightly less than 2 SDs. 50% of the class will have scored below the average. Which means that if Cathy scores 2 standard deviations above the average she'll have scored higher than 50% + (95% / 2) = 97.5%.
No matter how many standard deviations above the mean Cathy scores, there is no guarantee that she will score in the top 5% of the class.

In [ ]:

q1_3 = ...
q1_3

In [ ]:

grader.check("q1_3")

Question 1.4. Now, assuming that the scores for the exam will be normally distributed (as many exams are), what is the minimum number of standard deviations above the mean Cathy would have needed to score to guarantee that she fell in the top 5% of the class? Set variable q1_4 to either 1, 2, 3, or 4, depending on your answer.

Cathy would need to score roughly 4.5 standard deviations above the average to guarantee being in the top five percent. Using Chebyshev's inequality, setting $z = \sqrt{20} \approx 4.5$ gives that 95% of the data will lie between plus or minus 4.5 SDs. If Cathy scores above 4.5 SDs, then she is guaranteed to have scored better than 95% of the other students.
Cathy would need to score above 2 SDs. Since 95% of the data falls between plus or minus 2 SDs, if Cathy scores above 2 SDs, she is guaranteed to score above 95% of the class.
Cathy would need to score slightly less than 2 SDs. 50% of the class will have scored below the average. Which means that if Cathy scores 2 standard deviations above the average she'll have scored higher than 50% + (95% / 2) = 97.5% of the class.
No matter how many standard deviations above the mean Cathy scores, there is no guarantee that she will score in the top 5% of the class.

In [ ]:

q1_4 = ...
q1_4

In [ ]:

grader.check("q1_4")

Cathy and Sam thank you for your analysis 👋, and go on their way to start studying for their finals, which are just around the corner.

2. Choosing Sample Size

A new class is being offered at UCSD and the administration wants to know how many students will be taking the class so they know how big of a classroom it will need. To take the class, a student must have satisfied the prerequisites first.

The administration knows there are 900 students eligible to take the class, but they don't have the resources to ask each of them whether they are going to take the class. They decide to ask a sample of the students, but they don't know how many students to ask. They want the width of their confidence interval to be at most 10 students.

For example, if the results of their sample concluded that with 95% confidence between 200 and 210 students would take the class, the adminstration would be happy with that sample. However if the results of the sample concluded that with 95% confidence between 200 and 300 students would take the class, the sample would not have been informative enough because that range is too wide. We are going to help determine how big of a sample the administration should take.

The population parameter we are interested in measuring is the proportion of eligible students who will take the class. We will estimate this using a sample statistic, the proportion of eligible students in the sample who plan to take the class.

So where do we start?

The Central Limit Theorem tells us that regardless of the distribution of our population, the distribution of the sample mean or sum will always be roughly normal. Fortunately, our sample statistic (the proportion of eligible students who will take the class) is also a sample mean, because proportions are just means of 0s and 1s. Let's run a simulation to see this for ourselves.

Below is the data for the whole population of eligible students. (If the administration had the resources to ask every student whether they were going to take the class, this is what they would see. "0" means they won't take the class and "1" means they will.)

In [ ]:

population = bpd.read_csv("data/population.csv")
population

In [ ]:

population.plot(y='Planning on taking', kind='hist', density=True, bins=np.arange(-0.5, 2.5, 1), ec='w');
plt.xticks([0, 1], [0, 1]);

Question 2.1. Below is partially implemented code to run a simulation. The simulation will repeatedly take samples of size sample_size (without replacement) from population and calculate the proportion of students who plan on taking the class. Fill in the missing parts.

In [ ]:

def simulation(population, num_iterations, sample_size):
    results = np.array([])
    for i in np.arange(num_iterations):
        sampled = ...
        proportion_taking_class = ...
        results = ...
        
    bpd.DataFrame(data=results, columns=["Proportion"]).plot(kind='hist', 
                                                             y='Proportion', 
                                                             density=True,
                                                             ec='w',
                                                             bins=np.arange(0, 1, 1/(sample_size+1)),
                                                             title=f'Distribution of Sample Proportions (sample size = {sample_size})');
    plt.xlim(0, 1);

In [ ]:

grader.check("q2_1")

Run the cell below to see the empirical distribution of 10000 simulated sample proportions with a sample size of 40.

In [ ]:

simulation(population, 10000, 40)

Question 2.2. Does the distribution of the sample proportion look more like a normal curve or more like the population distribution?

More like a normal curve.
More like the original population.

In [ ]:

q2_2 = ...
q2_2

In [ ]:

grader.check("q2_2")

We also know that as we increase the sample size, the standard deviation of our sample proportion's distribution will decrease. Again we decide to run a simulation to double check. Run the following cell to see how the distribution of the sample proportion changes as we increase the size of our sample. It might take a while to run.

In [ ]:

simulation(population, 10000, 40)
simulation(population, 10000, 120)
simulation(population, 10000, 360)

This trend is expressed by the formula $\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}$

Since proportions are means, we can use this formula to find the sample size we need to get a desired standard deviation of the sample proportion, and thus a certain confidence interval for that sample proportion. However, before taking our sample, we don't have any way of knowing the standard deviation of our population. Lecture 23, the CIT textbook, and Homework 7 include some ways to get around this problem; here we will use the actual population standard deviation.

Question 2.3. The administration wants the confidence interval to have a width of 10 students, but we have been calculating the proportion of eligible students who are planning on taking the class. Using the number of students who are eligible to take the class, determine what proportion of that number equals 10 students, and save the result as width_as_proportion.

For example, if 500 students are eligible to take the class, then 10 students as a proportion is 0.02.

In [ ]:

num_eligible_students = ...
print('Number of eligible students:', num_eligible_students)
width_as_proportion = ...
print('Desired confidence interval width, as a proportion:', width_as_proportion)

In [ ]:

grader.check("q2_3")

Question 2.4. Now let's calculate the sample standard deviation we would need for our 95% confidence interval to have a width of width_as_proportion. Remember that for a normal distribution, 95% of the data lies between plus and minus 2 SDs of the mean. Set the variable target_sd to equal the standard deviation we would need for our 95% confidence interval to have a width of width_as_proportion).

In [ ]:

target_sd = ...
target_sd

In [ ]:

grader.check("q2_4")

Question 2.5. We also need to calculate the standard deviation of the total population. Calculate this value using thecompute_sd function that you wrote earlier and store it in the variable population_sd.

In [ ]:

population_sd = ...
population_sd

In [ ]:

grader.check("q2_5")

Question 2.6. Now calculate the required sample size and store your result as req_sample_size. Recall that

\text{SD of Distribution of Possible Sample Means} = \frac{\text{Population SD}}{\sqrt{\text{sample size}}}

You have already calculated $\text{SD of Distribution of Possible Sample Means}$ , in 2.4, and $\text{Population SD}$ , in 2.5.

In [ ]:

req_sample_size = ...
req_sample_size

In [ ]:

grader.check("q2_6")

Question 2.7. Our required sample size is bigger than our entire population. For each part, say whether it is True or False.

The administration will have to settle for a wider interval to get 95% confidence.
Sampling with replacement will be a feasible way to determine the information the administration needs.
The administration will have to settle for a lower degree of confidence to get an interval of width 10.
We should increase the size of the population until the sample size is smaller than the size of the population.

Set each variable below to either True or False.

In [ ]:

statement_1 = ...
statement_2 = ...
statement_3 = ...
statement_4 = ...

In [ ]:

grader.check("q2_7")

Finish Line 🏁

Congratulations! You are done with Lab 7.

To submit your assignment:

Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells.
Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.

In [ ]:

# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()