Homework 5: Simulation, Sampling, and Hypothesis Testing
Due Tuesday, November 8th at 11:59PM
Welcome to Homework 5! This homework will cover:
Simulations (see CIT 9.3-9.4)
Sampling and Empirical Distributions (see CIT 10-10.4)
Models and Hypothesis Testing (see CIT 11.2)
Instructions
This assignment is due on Tuesday, November 8th at 11:59PM. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.
Important: For homeworks, the otter tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the Calendar) or EdStem. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.
1. Lucky Triton Lotto, Continued 🔱 🎱 🧜
In the last homework, we calculated the probability of winning the grand prize (free housing) on a Lucky Triton Lotto lottery ticket, and found that it was quite low 😭.
In this question, we'll approach the same question not using math, but using simulation.
It's important to remember how this lottery works:
When you buy a Lucky Triton Lotto ticket, you first pick five different numbers, one at a time, from 1 to 62. Then you separately pick a number from 1 to 16, which may or may not be the same as one of the first five. These are your numbers. For example, you may select (15, 1, 13, 3, 61, 8). This is a sequence of six numbers - order matters!
The winning numbers are chosen by King Triton drawing five balls, one at a time, without replacement, from a pot of white balls numbered 1 to 62. Then, he draws a gold ball, the Tritonball, from a pot of gold balls numbered 1 to 16. Both pots are completely separate, hence the different ball colors. For example, maybe the winning numbers are (13, 15, 62, 3, 5, 8).
We’ll assume for this problem that in order to win the grand prize (free housing), all six of your numbers need to match the winning numbers and be in the exact same positions. In other words, your entire sequence of numbers must be exactly the same. However, if some numbers in your sequence match up with the corresponding number in the winning sequence, you will still win some Triton Cash.
Suppose again that your numbers are (15, 1, 13, 3, 61, 8) and the winning numbers are (13, 15, 62, 3, 5, 8). In this case, two of your numbers are considered to match two of the winning numbers. Notice that although both sequences include the number 15 within the first five numbers (representing a white ball), since they are in different positions, that's not considered a match.
Your numbers: (15, 1, 13, 3, 61, 8)
Winning numbers: (13, 15, 62, 3, 5, 8)
Question 1.1. Write a function called simulate_one_ticket. It should take no arguments, and it should return an array with 6 random numbers, simulating how the numbers are selected for a single Lucky Triton Lotto ticket. The first five numbers should all be randomly chosen without replacement, from 1 to 62. The last number should be between 1 and 16.
Question 1.2. It's draw day. You checked the winning numbers King Triton drew, which happened to be (55, 12, 3, 51, 23, 5). You didn't win free housing, and you are quite sad.
Suppose you want to remind yourself how unlikely it is to win the grand prize. Call the function simulate_one_ticket 100,000 times. In your 100,000 tickets, how many times did you win the grand prize (free housing)? Assign your answer to count_free_housing. (It would cost a fortune if you were to buy 100,000 tickets – it's pretty nice to be able to simulate this experiment instead of doing it in real life!)
Hints:
First, implement a simulation where you only buy 10 tickets. Once you are sure you have that figured out, change it to 100,000 tickets. It may take a little while (up to a minute) for Python to perform the calculations when you are buying 100,000 tickets.
You'll have to count how many of the numbers you chose match the numbers that were drawn. One way to do this involves
np.count_nonzero. Remember you need all the numbers to match to win the grand prize.
Remember, the mathematical probability of winning free housing is quite low, on the order of . That's a lot lower than than 1 in 100,000, which is .
Question 1.3. As we've seen, you would need to be extremely lucky to win the grand prize. To encourage more students to buy Lucky Triton Lotto tickets, students can win Triton Cash if some of their numbers match the corresponding winning numbers, as described in the introduction. Again, simulate the act of buying 100,000 tickets, but this time find the greatest number of matches achieved by any of your tickets, and assign this number to most_matches.
The winning numbers are the same from the previous part: (55, 12, 3, 51, 23, 5).
For example, if 90,000 of your tickets matched 1 winning number and 10,000 of your tickets matched 2 winning numbers, then you would set most_matches to 2. If 99,999 of your tickets matched 1 winning number and one of your tickets matched 4 winning numbers, you would set most_matches to 4. If you happened to win the grand prize on one of your tickets, you would set most_matches to 6. Remember, order matters.
Hint: There are several ways to approach this; one way involves storing the number of matches per ticket in an array and finding the largest number in that array.
Question 1.4. Suppose one Lucky Triton Lotto ticket costs $2.
The Lucky Triton Lotto advertisement on Instagram promises you will never lose money because of the following generous prizes:
Win $10 with a 1-number match
Win $25 with a 2-number match
Win $100 with a 3-number match
Win $1,000 with a 4-number match
Win $5,000 with a 5-number match
Win $20,000 with a 6-number match (free housing!)
If you had the money to buy 100,000 tickets, what would be your net winnings from buying these tickets? Since this is net winnings, this should account for the prizes you win and the cost of buying the tickets. Assign the amount to net_winnings. Note that a positive value means you won money overall, and a negative value means you lost money overall. Do you believe the advertisement's claims?
The winning numbers are the same from the previous part: (55, 12, 3, 51, 23, 5).
Hint: Again, there are a few ways you could approach this problem. One way involves generating another 100,000 random tickets and counting the amount earned per ticket, adding to a running total. Alternatively, if you created an array of the number of matches per ticket in Question 1.3, you could loop through that array.
2. Sampling with Netflix 🍿
In this question, we will use a dataset consisting of information about all Netflix Original movies to get some practice with sampling. Run the cell below to load the data into a DataFrame, indexed by title.
We've provided a function called compute_statistics that takes as input a DataFrame with two columns, 'Runtime' and 'IMDb Score', and then:
draws a histogram of
'Runtime',draws a histogram of
'IMDb Score', andreturns a two-element array containing the mean
'Runtime'and mean'IMDb Score'.
Run the cell below to define the compute_statistics function, and a helper function called histograms. Don't worry about how this code works, and please don't change anything.
We can use this compute_statistics function to show the distribution of 'Runtime' and 'IMDb Score' and compute their means, for any collection of movies.
Run the next cell to show these distributions and compute the means for all Netflix Original movies. Notice that an array containing the mean 'Runtime' and mean 'IMDb Score' values is displayed before the histograms.
Now, imagine that instead of having access to the full population of movies, we only have access to data on a smaller subset of movies, or a sample. For 584 movies, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky. Instead, we often make statistical inferences about a large underlying population using a smaller sample.
Statistical inference is the process of using data in a sample to infer some characteristic about the population from which the sample was drawn. A common strategy for statistical inference is to estimate parameters of the population by computing the same statistics on a sample. This strategy sometimes works well and sometimes doesn't. The degree to which it gives us useful answers depends on several factors.
One very important factor in the utility of samples is how they were gathered. Let's look at some different sampling strategies.
Convenience sampling
One sampling methodology, which is generally a bad idea, is to choose movies which are somehow convenient to sample. For example, you might choose movies that you have personally watched, since it's easier to collect information about them. This is called, somewhat pejoratively, convenience sampling.
Question 2.1. Suppose you love scary movies 👻 and you decide to manually look up information on all Netflix Original movies in the following genres:
'Horror''Thriller''Horror thriller'
Assign convenience_sample to a subset of movie_data that contains only the rows for movies that are in one of these genres.
Question 2.2. Assign convenience_stats to an array of the mean 'Runtime' and mean 'IMDb Score' of your convenience sample. Since they're computed on a sample, these are called sample means.
Hint: Use the function compute_statistics; it's okay if histograms are displayed as well.
Next, we'll compare the distribution of 'Runtime' in our convenience sample to the distribution of 'Runtime' for all the movies in our dataset.
Question 2.3. From what you see in the histograms above, did the convenience sample give us an accurate picture of the runtimes for the full population of movies? Why or why not?
Assign either 1, 2, 3, or 4 to the variable sampling_q3 below.
Yes. The sample is large enough, so it is an accurate representation of the population.
No. Normally convenience samples give us an accurate representation of the population, but only if the sample size is large enough. Our convenience sample here was too small.
No. Normally convenience samples give us an accurate representation of the population, but we just got unlucky.
No. Convenience samples generally don't give us an accurate representation of the population.
Simple random sampling
A more principled approach is to sample uniformly at random from the movies. If we ensure that each movie is selected at most once, this is a random sample without replacement, sometimes abbreviated to "simple random sample" or "SRS". Imagine writing down each movie's title on a card, putting the cards in a hat, and shuffling the hat. To sample, pull out cards one by one and set them aside, stopping when the specified sample size is reached.
We've produced two simple random samples of ratings_data: the variable small_srs_data contains a SRS of size 70, and the variable large_srs_data contains a SRS of size 180.
Now we'll run the same analyses on the small simple random sample, the large simple random sample, and the convenience sample. The subsequent code draws the histograms and computes the means for 'Runtime' and 'IMDb Score'.
Producing simple random samples
Often it's useful to take random samples even when we have a larger dataset available. One reason is that doing so can help us understand how inaccurate other samples are.
As we saw in Lecture 14, DataFrames have a .sample method for producing simple random samples. Note that its default is to sample without replacement, which aligns with how simple random samples are drawn.
Question 2.4. Produce a simple random sample without replacement of size 70 from movie_data. Store an array containing the mean 'Runtime' and mean 'IMDb Score' of your SRS in my_small_stats. Again, it's fine if histograms are displayed.
Run the cell in which my_small_stats is defined many times, to collect new samples and compute their sample means.
Now, recall, small_stats is an array containing the mean 'Runtime' and mean 'IMDb Score' for the one small SRS that we provided you with:
Answer the following two-fold question:
Are the values in
my_small_stats(the mean'Runtime'and'IMDb Score'for your small SRS) similar to the values insmall_stats(the mean'Runtime'and'IMDb Score'for the small SRS we provided you with)?Each time you collect a new sample – i.e. each time you re-run the cell where
my_small_statsis defined – do the values inmy_small_statschange a lot?
Assign either 1, 2, 3, or 4 to the variable sampling_q4 below.
The values in
my_small_statsare identical to the values insmall_stats, and change a bit each time a new sample is collected.The values in
my_small_statsare identical to the values insmall_stats, and don't change at all each time a new sample is collected.The values in
my_small_statsare very different from the values insmall_stats, and don't change at all each time a new sample is collected.The values in
my_small_statsare slightly different from the values insmall_stats, and change a bit each time a new sample is collected.
Question 2.5. Similarly, create a simple random sample of size 180 from movie_data and store an array of the sample's mean 'Runtime' and mean 'IMDb Score' in my_large_stats.
Run the cell in which my_large_stats is defined many times. Do the histograms and mean statistics (mean 'Runtime' and mean 'IMDb Score') seem to change more or less across samples of size 180 than across samples of size 70?
Assign either 1, 2, or 3 to the variable sampling_q5 below.
The statistics change less across samples of size 180 than across samples of size 70.
The statistics change an equal amount across samples of size 180 and across samples of size 70.
The statistics change more across samples of size 180 than across samples of size 70.
3. Was it by Random Chansey? 🎲

You recently decided to buy the video game Pokémon Yellow from someone on Ebay. The seller tells you that they've modified the game so that the probabilities of encountering certain Pokémon in certain locations have been altered. However, the seller doesn't tell you which specific locations have had their probability models changed and what they've been changed to.
As you are playing Pokémon Yellow, you arrive at the Safari Zone, one of the most iconic locations in the game. You're curious as to your chances of encountering your favorite Pokémon, Chansey, in this location. You go onto Bulbapedia to find the probability model for this location, and you discover that for each Pokémon encounter in the Safari Zone, there is a 4% chance of encountering Chansey.
After a few hours of gameplay in the Safari Zone, you have encountered Chansey only 23 times out of 784 total Pokémon encounters (around 2.9%). You start to suspect that the Safari Zone may have been one of the locations in which the previous owner of the game changed the probability model.
To test this, you decide to run a hypothesis test with the following hypotheses:
Null Hypothesis: In your copy of Pokémon Yellow, the probability of encountering Chansey at each Pokémon encounter in the Safari Zone is 4%.
Alternative Hypothesis: In your copy of Pokémon Yellow, the probability of encountering Chansey at each Pokémon encounter in the Safari Zone is less than 4%.
Question 3.1. Complete the implementation of the function one_simulation, which has no arguments. It should randomly generate 784 Pokémon encounters in the Safari Zone and return the proportion of encountered Pokémon that were Chansey.
Hint: Use np.random.multinomial.
Question 3.2. The test statistic for our hypothesis test will be the difference between the proportion of Chansey encounters in a given sample of 784 Safari Zone encounters and the expected proportion of Chansey encounters, i.e.
Let's conduct 10,000 simulations. Create an array named proportion_diffs containing 10,000 simulated values of the test statistic described above. Utilize the function created in the previous question to perform this task.
Question 3.3. Calculate the p-value for this hypothesis test, and assign the result to safari_zone_p.
Hint: Do large values of our test statistic favor the alternative hypothesis, or do small values of our test statistic favor the alternative hypothesis?
Question 3.4. Using the standard p-value cutoff of 0.05, what can we conclude from our hypothesis test? Assign either 1, 2, 3, or 4 to the variable safari_zone_conclusion, corresponding to the best conclusion.
We reject the null hypothesis. There is not enough evidence to say whether the observed data is consistent with the model.
We reject the null hypothesis. The observed data is inconsistent with the model.
We accept the null hypothesis. The observed data is consistent with the model.
We fail to reject the null hypothesis. There is not enough evidence to say that the observed data is inconsistent with the model.
Question 3.5. In this question, we chose as our test statistic the proportion of Chansey encounters in the Safari Zone minus 0.04. But this is not the only statistic we could have chosen; there are many that could have worked here.
From the options below, choose the test statistic that would not have worked for this hypothesis test, and assign 1, 2, 3, or 4 to the variable bad_choice.
The number of Chansey encounters out of 784 enounters in the Safari Zone.
The proportion of Chansey encounters in the Safari Zone.
0.04 minus the proportion of Chansey encounters in the Safari Zone.
The absolute difference between 0.04 and the proportion of Chansey encounters in the Safari Zone.
Hint: Our goal is to find a test statistic that will help us determine whether we encounter Chansey less often than expected.
4. Surprise Mini Brands! 🍭🧴🩹
When you buy a Surprise Mini Brands toy, you open it up to reveal tiny replicas of branded supermarket products. Here are some of the possible items you may see when opening a Surprise Mini Brands toy: 
No, that is not real pasta sauce!
There are four types of replicas in a Surprise Mini Brands toy: 'Gold', 'Metallic', 'Glow in the Dark', and 'Common'. The first three are "rare" types, which are made of special materials.
Unfortunately, Zuru, the company behind Surprise Mini Brands, doesn't make public the probability of getting any of the four types of replicas. A DSC 10 tutor proposed the following probability distribution:
| Type | Estimated Probability of Type |
|---|---|
| Gold | |
| Metallic | |
| Glow in the Dark | |
| Common |
We'll store this distribution in an array, in the order 'Gold', 'Metallic', 'Glow in the Dark', and 'Common':
To assess the validity of their model, the tutor surveyed many individuals who purchased Surprise Mini Brands toys and asked them for the types of replicas they received. In total, they were given information about 15,525 replicas, out of which:
818 were
'Gold',976 were
'Metallic',412 were
'Glow in the Dark', andthe rest were
'Common'.
We can calculate the empirical type distribution using survey data and store it in an array as well (in the same order as before):
While empirical_type_distribution is not identical to type_distribution_tutor, it's still possible that the tutor's model is plausible, and that the observed differences are due to random chance. Let's run a hypothesis test to investigate further, using the following hypotheses:
Null Hypothesis: The types of Surprise Mini Brands toys are drawn randomly from the distribution type_distribution_tutor.
Alternative Hypothesis: The types of Surprise Mini Brands toys are not drawn randomly from the distribution type_distribution_tutor.
Note that this hypothesis test involves four proportions – one for each of 'Gold', 'Metallic', 'Glow in the Dark', and 'Common'.
Question 4.1. Which of the following is not a reasonable choice of test statistic for this hypothesis test? Assign 1, 2, or 3 to the variable unreasonable_test_statistic.
The total variation distance between the proposed distribution (expected proportion of types) and the empirical distribution (actual proportion of types).
The sum of the absolute difference between the proposed distribution (expected proportion of types) and the empirical distribution (actual proportion of types).
The absolute difference between the sum of the proposed distribution (expected proportion of types) and the sum of the empirical distribution (actual proportion of types).
Question 4.2. We'll use the TVD, i.e. total variation distance, as our test statistic. Below, complete the implementation of the function total_variation_distance, which takes in two distributions (stored as arrays) as arguments and returns the total variation distance between the two arrays.
Then, use the function total_variation_distance to determine the TVD between the type distribution proposed by the tutor and the empirical type distribution observed. Assign this TVD to observed_tvd.
Question 4.3. Now, we'll calculate 5,000 simulated TVDs to see what a typical TVD between the proposed distribution and an empirical distribution would look like if the tutor's model were accurate. Since our real-life data includes 15,525 replicas, in each trial of the simulation, we'll:
draw 15,525 replicas at random from the tutor's proposed distribution, then
calculate the TVD between the type distribution proposed by the tutor and the empirical type distribution from the simulated sample.
Store these 5,000 simulated TVDs in an array called simulated_tvds.
Question 4.4. Now, check the p-value of our test by computing the proportion of times in our simulation that we saw a TVD greater than or equal to our observed TVD. Assign your result to type_p_value.
Question 4.5. Using the standard p-value cutoff of 0.05, what can we conclude from our hypothesis test? Assign either 1, 2, 3, or 4 to the variable type_conclusion, corresponding to the best conclusion.
We reject the null hypothesis. There is not enough evidence to say whether the observed data is consistent with the model.
We reject the null hypothesis. The observed data is inconsistent with the model.
We accept the null hypothesis. The observed data is consistent with the model.
We fail to reject the null hypothesis. There is not enough evidence to say that the observed data is inconsistent with the model.
Finish Line 🏁
To submit your assignment:
Select
Kernel -> Restart & Run Allto ensure that you have executed all cells, including the test cells.Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using
File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.