Path: blob/main/final_project/final_project.ipynb
3057 views

Welcome to the Final Project!
Deadlines 📅: This assignment has two deadlines.
The first deadline is Thursday, November 17th at 11:59PM. This is a checkpoint by which you must submit Sections 0 and 1 of the project. You may not use slip days to extend this deadline.
The second deadline is Tuesday, November 29th at 11:59PM, and this is for your completed project submission. You may use up to two slip days to extend this deadline. If working with a partner and using slip days, slip days will be deducted from each person’s allocation individually. If one or both partners has run out of slip days and you submit the project late, we will reallocate slip days towards the final project, away from lesser-weighted assignments, as described in the syllabus.
10% of your grade will come from your score on the checkpoint, and 90% of your grade will come from your score on the final submission. See the EdStem post titled "Final Project Released!" for more details.
Partners 👯: You are very much encouraged to find a partner to work through the project with. If you work with a partner, you are both required to actively contribute to all parts of the project. Updated partner guidelines are available on the course website.
Rules 📜: Don't share your code with anybody but your partner. You are welcome to discuss questions with other students, but don't share the answers. The experience of solving the problems in this project will prepare you for the final exam and your future in data science. If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem.
Support 🤝: You are not alone! Come to office hours, post on EdStem, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private EdStem post and course staff will try to respond. All of the concepts necessary for this project were covered in lecture and can be found in the textbook and babypandas notes. If you are stuck on a particular problem, reading through the relevant textbook section or referencing the Jupyter notebook from lecture will often help clarify the concept.
Tests 🧪: The otter tests don't usually tell you that your answer is correct. More often, they help catch basic mistakes. It's up to you to ensure that your answer is correct. Additional tests will be applied to verify the correctness of your submission in order to assign your final score, so be careful and check your work!
Advice 🦉: First, start early. As you may know from the Midterm Project, projects are complex and time-consuming. Second, develop your answers incrementally. To perform a complicated task, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells, and you can add additional cells as needed. Don't try to do everything in one cell without seeing the intermediate output. In particular, for simulations where you need to do something many times, first just do the process once and make sure the results look reasonable. Then wrap your code inside a for-loop to repeat it. Similarly, for defining functions, first write code that will produce the desired output for a single fixed input. Then, once you know it's working, you can put that code inside a function and change the input to be a variable.
Random Seeds 🌱: This project uses random seeds, as described in Homework 6. When we provide a random seed, we set the internal configurations of Python's random number generator so that it produces the same results every time, making it easier to grade your submission. You don't need to understand how random seeds work, just be aware that when you see a call to np.random.seed:
Don't change it.
Don't be alarmed if you see the same results each time you run that cell.
Long Simulations ⏳: If any of your cells are taking more than five minutes to run, you are probably doing something wrong. You can sometimes speed things up by making sure you have a DataFrame of only the rows and columns you need to do your analysis, which should be defined outside the for loop of your simulation. When possible, try to avoid using additional for-loops and queries inside a simulation, and see if a faster method, like a numpy method or groupby, could be used instead. If you haven't yet watched the video we provided on when not to use for-loops, it should be well worth your time.
Let's get started!
Run the cell below to load the packages you'll need to do this project. Please do not import any additional packages – you don't need them, and the Gradescope autograder may not be able to run your code if you do.
Outline
Use the outline below to help you quickly navigate to the part of the project you're working on. Most questions are worth 1 point. A few are worth 0 points in that they are not directly graded, but they will be indirectly graded, as the results are used in subsequent questions. Any questions that are worth more than 1 point will be marked with one ⭐ for each point. You can expect questions with ⭐ markings to be longer and more challenging than the other questions.
There are two major comic publishing companies in the US, Marvel Comics and DC Comics. These companies have been rivals for decades, and devoted comic fans will not hesitate to share their opinions about which company is better. As one article put it,
For decades Marvel and DC were the Coke and Pepsi … the McDonald's and Burger King … the Yankees and Red Sox of superhero publishers.

In this project, we'll work with data gathered from Marvel Wikia and DC Wikia, which are publicly editable databases of all things Marvel and DC. This means our data was inputted by adoring comic fans. Our data was collected from these sites in 2014 by FiveThirtyEight and is publicly available here. We've stored the data sets in two files, data/dc-wikia-data.csv and data/marvel-wikia-data.csv. We provide FiveThirtyEight's description of the columns in the data sets below.
| Variable | Definition |
|---|---|
'page_id' | The unique identifier for that characters page within the wikia |
'name' | The name of the character |
'urlslug' | The unique url within the wikia that takes you to the character |
'ID' | The identity status of the character (Secret Identity, Public Identity, [on Marvel only: No Dual Identity]) |
'ALIGN' | If the character is Good, Bad or Neutral |
'EYE' | Eye color of the character |
'HAIR' | Hair color of the character |
'SEX' | Sex of the character (e.g. Male, Female, etc.) |
'GSM' | If the character is a gender or sexual minority (e.g. Homosexual characters, Bisexual characters) |
'ALIVE' | If the character is alive or deceased |
'APPEARANCES' | The number of appareances of the character in comic books (as of Sep. 2, 2014. Number will become increasingly out of date as time goes on.) |
'FIRST APPEARANCE' | The month and year of the character's first appearance in a comic book, if available |
'YEAR' | The year of the character's first appearance in a comic book, if available |
We've modified the original data to fill in missing values.
For characters with no available data for
'ID','ALIGN','EYE','HAIR','SEX', or'ALIVE', we've replaced missing values with the string'Data Unavailable'.The original data only included an entry in the
'GSM'column for gender or sexual minorities. We've replaced missing values here with the string'Not Minority'.For characters with no available data for
'APPEARANCES'or'YEAR', we've replaced missing values with zeros.
Keep in mind that since our data came from publicly editable databases, the data can (and in fact, does) have mistakes. We'll ignore these mistakes and just analyze the data as it's given. Don't try to correct any issues with the data or you may cause problems for our Gradescope autograder. (You'll learn more about how to deal with these issues in future courses, like DSC 80.)
Let's read in the data and see what we'll be working with.
Question 0.1. There are a couple of modifications we should make to dc_raw and marvel_raw to clean them before we can proceed with our analyses.
We will not be using the
'page_id','urlslug', and'FIRST APPEARANCE'columns, so these should be dropped.'APPEARANCES'and'YEAR'are both stored as strings, but we need them to be stored as ints.
Below, complete the implementation of the function clean_dataframe, which takes a single argument, df, and returns a cleaned version of df, as detailed above. Then, use clean_dataframe to clean both dc_raw and marvel_raw, and store the cleaned DataFrames in the variables dc_clean and marvel_clean, respectively.
Question 0.2. Currently, both dc_clean and marvel_clean are indexed by the default babypandas index of 0, 1, 2, 3, etc. We want to try to find a more informative index. To do that, we need to find a column whose values are all distinct (i.e. unique).
Below, complete the implementation of the function all_distinct, which takes in a DataFrame (df) and the name of a column in that DataFrame (column_name), and returns True if all values in that column are distinct, and False otherwise.
Hint: Use np.unique or the Series method .unique().
To check your work, run the following two cells.
If your implementation of all_distinct is correct, you should see that the 'name' of each DC character is unique, and the 'name' of each Marvel character is unique, so 'name' is a good choice of index. Run the next cell to create new DataFrames dc and marvel indexed by name.
For the remainder of the project, we'll work with the dc and marvel DataFrames.
(Checkpoint) Section 1: Just a Rumor, or Insider Information? 👀
You spend a lot of time in online forums for comic lovers. One day, you see a post by a user named "DC Bigshot" who claims to be a DC employee. DC Bigshot claims that when DC creates a new male character, they make him a bad character with probability 50%, a good character with probability 40%, and a neutral character with probability 10%. You want to determine whether this claim is supported by the data, so you know whether to trust DC Bigshot's claims in general.
Question 1.1. Assign dc_align to a DataFrame that only includes the male characters from dc that are considered 'Bad Characters', 'Good Characters', or 'Neutral Characters'.
Question 1.2. Assign observed_dist to an array containing the proportion of male DC characters that are bad, good, and neutral (in that order). Since we'll only consider these three values for 'ALIGN', the three proportions in your array should sum to one.
We now have everything we need to perform a hypothesis test for the distribution of "goodness" among male DC characters. Our hypotheses are as follows:
Null Hypothesis: Among male characters from DC, there is a 50% chance that the character is bad, a 40% chance the character is good, and a 10% chance that the character is neutral. Any observed differences from this distribution are due to chance.
Alternative Hypothesis: Among male characters from DC, there is a different distribution of bad, good, and neutral characters.
In each iteration of our simulation, we will draw the same number of characters as there are in dc_align at random from a population that is 50% bad, 40% good, and 10% neutral. We will then determine the proportion of characters in this sample that are bad, good, and neutral; this will give us an observed categorical distribution. We will compare this distribution to the distribution of goodness according to the null, [0.5, 0.4, 0.1], using the total variation distance (TVD) as our test statistic.
Below, we've provided an implementation of the total variation distance.
The next cell calculates the the total variation distance between our observed distribution and the null distribution, which we'll call align_model.
Question 1.3. ⭐⭐ Generate 10,000 simulated values of the test statistic using the approach described above and place them in an array called tvds.
Run the cell below to draw a histogram of your simulated TVDs, with a black line drawn at the observed TVD.
Question 1.4. Compute the p-value of our hypothesis test by computing the proportion of times in our simulation that we saw a TVD equal to the observed TVD or more extreme in the direction of the alternative hypothesis. Assign your result to align_p.
Question 1.5. What can we conclude based on the value of align_p? Assign q1_conclusion to 1, 2, or 3.
Using a 5% cutoff, we reject the null hypothesis that among male characters from DC, there is a 50% chance that the character is bad, a 40% chance the character is good, and a 10% chance that the character is neutral.
Using a 5% cutoff, we accept the null hypothesis.
Using a 5% cutoff, the null hypothesis is consistent with what we observed.
Question 1.6. Set the variable new_model to an array containing proportions for [bad, good, neutral] such that if we did another hypothesis test with the following hypotheses, the conclusion would be different than the hypothesis test performed above.
Null Hypothesis: Among male characters from DC, the distribution of bad, good, and neutral characters is given by the proportions in
new_model.Alternative Hypothesis: Among male characters from DC, there is a different distribution of bad, good, and neutral characters.
Note: There are many possible correct answers to this question.
Question 1.7. To verify that you chose new_model correctly, conduct a hypothesis test using the total variation distance between the observed distribution and new_model as your test statistic.
Generate 10,000 values of the test statistic and place them in an array called new_tvds. You should be able to do this by taking your code from 1.3 and making only small changes.
Again, we have provided code that plots a histogram with a black vertical line, allowing you to visualize the simulated estimates and observed TVD.
Question 1.8. Assign to new_p the p-value for this hypothesis test. Confirm that your hypothesis test has a different conclusion than before with a p-value cutoff of 5%.
Congratulations! You've reached the end of the checkpoint portion of the project. Follow the instructions to submit your work to the Final Project (Checkpoint) assignment on Gradescope.
Part 2: Comparing Demographics 🙋🙋♂️
In this part, we will compare the goodness of characters from different groups, using permutation tests.
Section 2: DC vs. Marvel 🥊
Let's start by comparing the goodness of DC characters and the goodness of Marvel characters. Before we conduct our permutation test, we'll need to perform a bit of DataFrame manipulation to get our data in the right format.
Question 2.1. Below, assign all_characters to a DataFrame with all characters from both comic companies. Make sure the rows for DC characters appear before the rows for Marvel characters. all_characters should include all of the columns in dc and marvel, plus an additional column called 'COMPANY', containing a string, either 'DC' or 'Marvel'.
Hint: You may find the function np.repeat and the DataFrame method .append useful.
Question 2.2. Create a new DataFrame, all_characters_goodness, which contains only the rows in all_characters where there is data in the 'ALIGN' column (i.e., all of the rows where the value in 'ALIGN' isn't 'Data Unavailable').
all_characters_goodness should contain all of the columns in all_characters plus one additional column, called 'GOOD', that has:
The value
1for'Good Characters'The value
0for'Neutral Characters'and'Reformed Criminals'The value
-1for'Bad Characters'
Hint: You may want to create your own function and use .apply.
Important: The rest of the assignment will rely on both all_characters and all_characters_goodness being correct. Make sure you've passed all the grader checks for these questions before proceeding, and check your work carefully!
As stated above, we're interested in comparing characters from DC to characters from Marvel. We can start to do this by grouping all_characters_goodness by 'COMPANY':
It appears that the DC characters we have in our data set are more "good" than the Marvel characters in our data set on average. However, we have to ask ourselves the question, "is this difference reflective of a difference in the population of all comic characters, or did it happen by chance in our sample?"
We'll conduct a permutation test to answer that very question, but before we do that, we need to create one more function to calculate the test statistic, the difference in means.
Question 2.3. Complete the implementation of the function diff_of_means, which takes in five arguments:
df, a DataFramegroup_column, the name of a column indfthat contains two distinct valuesgroup_1andgroup_2, the two distinct values ingroup_columndata_column, a column containing numerical data
and returns the difference in the mean value of data_column for the two groups (do group_1 mean minus group_2 mean).
After that, use your function to assign observed_diff to the difference in mean goodness for DC and Marvel (do DC minus Marvel).
Hint: For guidance, look at the code cell immediately above this question. You will need to generalize that code as part of your solution.
We're now ready to run a permutation test to compare the goodness of DC characters to the goodness of Marvel characters. Our hypotheses are as follows:
Null Hypothesis: The goodness of DC characters and Marvel characters come from the same distribution.
Alternative Hypothesis: DC characters are more good than Marvel characters, on average.
Question 2.4. ⭐⭐ 100 times, shuffle either the 'COMPANY' or 'GOOD' column in all_characters_goodness, and calculate the difference in mean goodness between the resulting DC characters and resulting Marvel characters (again, using DC minus Marvel). Store your differences in the array differences.
all_characters_goodness has lots of rows and takes a long time to shuffle, so we're only doing 100 shuffles. Ideally, we'd do more, but even 100 shuffles might take up to a couple of minutes to run.
Note: We've defined a new DataFrame, to_shuffle, with only the columns relevant to this question. Feel free to use it if you'd like; you don't have to.
Run the cell below to draw a histogram of your simulated differences in means, with a black line drawn at the observed difference in means.
Question 2.5. Assign goodness_p to the proportion of times in our simulation that we saw a difference in means equal to the observed difference or more extreme in the direction of the alternative hypothesis.
Our histogram and p-value show it's unlikely that a difference in means as extreme as the one we observed could have happened by chance, according to our null hypothesis. Per any reasonable p-value cutoff, we'd reject the null hypothesis in this case, and conclude that DC characters are more "good" than Marvel characters, on average.
The evidence seems to support what DC fans have been saying for decades: "DC characters are better than Marvel characters!" 👊
Section 3: Male vs. Female Marvel Characters 🙋♂️🙋♀️
Next, we'll conduct a permutation test to compare the goodness of male and female Marvel characters.
However, instead of writing code specifically for this one example, in this section you will write code that will ultimately allow you to repeat a permutation test for any two groups of characters with just a single function call. Throughout this section, it will help to use your code from Section 2 as a starting point and generalize it.
Question 3.1. Complete the implementation of the function add_good, which takes in a DataFrame df and returns only the rows in df where there is data in the 'ALIGN' column (i.e., all of the rows where the value in 'ALIGN' isn't 'Data Unavailable'). The returned DataFrame should also have one additional column, called 'GOOD', which contains the numerical goodness of each character as defined in Question 2.2.
Hint: If you defined your own function in Question 2.2, you may want to use it again here.
Question 3.2. Let's create a DataFrame with only the rows and columns we'll use in our permutation test. Assign male_female to a DataFrame containing rows for only the Marvel characters that are male or female. male_female should only have two columns, 'SEX' and 'GOOD', as defined above.
Question 3.3. ⭐⭐⭐⭐ In Questions 2.4 and 2.5, you...
Computed 100 simulated differences in the mean goodness of two groups in particular (DC and Marvel),
Drew a histogram of the simulated differences, with a vertical black line placed at the observed difference in means, and
Computed a p-value, which was the proportion of simulations in which the simulated difference in means was equal to the observed difference in means or more extreme in the direction of the alternative hypothesis.
Below, complete the implementation of the function permutation_test. It should do all three steps above, but for any two groups. permutation_test takes in the same 5 arguments as diff_of_means, which you defined in Question 2.3. (It contains an additional optional argument, for_autograder; you should ignore this.)
Remember that we've defined the difference in group means to be group_1's mean minus group_2's mean. So, if the observed difference in means is positive, this suggests that the mean of group_1 may be larger than the mean of group_2. In that case, we'll formulate our null and alternative hypotheses like this:
Null Hypothesis:
group_1's data andgroup_2's data come from the same distribution.Alternative Hypothesis:
group_1has larger data values thangroup_2, on average.
This is the setup we used in Section 2 when comparing DC characters (group_1) to Marvel characters (group_2), because the observed difference in means was positive.
Conversely, if the observed difference in means is negative, this suggests that the mean of group_1 may be smaller than the mean of group_2. In that case, we'll formulate our null and alternative hypotheses like this:
Null Hypothesis:
group_1's data andgroup_2's data come from the same distribution.Alternative Hypothesis:
group_1has smaller data values thangroup_2, on average.
How you set up the null and alternative hypotheses has implications for how you calculate the p-value since the p-value represents the proportion of simulations in which the simulated difference in means was equal to the observed difference in means or more extreme in the direction of the alternative hypothesis.
Question 3.4. Use the male_female DataFrame and your newly-defined permutation_test function to determine a p-value for a permutation test comparing the goodness of male (group_1) and female (group_2) Marvel characters. Save your result as male_female_p.
As with Question 2.4, it might take up to a couple of minutes for this permutation test to run.
Question 3.5. What can we conclude based on the value of male_female_p? Assign male_female_conclusion to either 1, 2, or 3.
Our data is consistent with the null hypothesis that male and female Marvel characters are equally good, on average.
Our results suggest that female Marvel characters are significantly more good than male Marvel characters, on average.
Our results suggest that male Marvel characters are significantly more good than female Marvel characters, on average.
Section 4: Even More Comparisons 🆚
Now that we have a framework to perform a permutation test, we can easily compare different demographics to see if one group is statistically significantly more good than another group.
For each pair of groups in the table below, assign 1, 2, or 3 to the given variable name, according to this scheme:
Group 1 is more good than Group 2 (at a 5% p-value cutoff).
Group 2 is more good than Group 1 (at a 5% p-value cutoff).
Neither group is statistically significantly more good than the other (at a 5% p-value cutoff).
| Question | Group 1 | Group 2 | Variable Name |
|---|---|---|---|
| 4.1 | DC Living | DC Deceased | living_test |
| 4.2 | DC Blond Hair | DC Black Hair | hair_test |
| 4.3 | DC Secret Identity | DC Public Identity | identity_test |
| 4.4 | DC GSM Minority | DC Not Minority | minority_test |
Be careful: Even though you'll be comparing two groups, not all characters will necessarily fall into one of those two groups, as some variables have more than two distinct values. When calling permutation_test, make sure that the DataFrame you give it only has rows for the two groups that you're trying to compare, otherwise the results of your permutation test will be invalid.
Important: All you need to do for Section 4 is set each of the variables in the table above to 1, 2, or 3. You'll do this in the cells provided. Of course, you'll need to do some work to figure out the correct answer choice. Please follow these instructions:
Add new cells as needed to do the work required to determine the correct answer choice. You can add new cells by clicking the plus sign icon in the top menu.
Assign the variable to the correct answer choice in the cell provided.
Before you submit, convert any cells you added to 'Raw NBConvert' format. To change the format of a cell, click the down arrow next to the word "Code" in the top menu and select "Raw NBConvert" from the dropdown menu. If you need to modify your work later, you can convert these cells back to code. Make sure that any cells you added are in "Raw NBConvert" format before submitting, but that the cells that we originally provided you with are still in the original "Code" format.

As we mentioned earlier, the permutation_test function is very slow to run on our data sets, since they are so large. You'll have to run permutation_test to determine the correct answer to each of the four questions in this section, but by following the steps above, you'll ensure that your submission will run quickly on the Gradescope autograder.
Question 4.1. ⭐⭐ First, compare DC characters who are living (group 1) with those that are deceased (group 2). The 'ALIVE' column contains whether each character is living or deceased.
In the cell provided, assign the variable living_test to 1, 2, or 3, according to the instructions at the top of Section 4.
Do all of your work in separate cells, and remember to change the format of any cells you added to "Raw NBConvert" before submitting.
Question 4.2. ⭐⭐ Next, compare DC characters with blond hair (group 1) to DC characters with black hair (group 2). The 'HAIR' column contains the hair color of each character.
In the cell provided, assign the variable hair_test to 1, 2, or 3, according to the instructions at the top of Section 4.
Do all of your work in separate cells, and remember to change the format of any cells you added to "Raw NBConvert" before submitting.
Question 4.3. ⭐⭐ Next, compare DC characters with secret identities (group 1) to DC characters with public identities (group 2). The 'ID' column contains the identity status of each character.
In the cell provided, assign the variable identity_test to 1, 2, or 3, according to the instructions at the top of Section 4.
Do all of your work in separate cells, and remember to change the format of any cells you added to "Raw NBConvert" before submitting.
Question 4.4. ⭐⭐ Finally, compare DC characters who are a gender or sexual minority (group 1) to DC characters who are not a gender or sexual minority (group 2). The 'GSM' column contains information about whether or not each character is a gender or sexual minority.
Before you proceed, note that there are three unique values in the 'GSM' column of dc. One of them is 'Not Minority'; both of the other two should be counted as 'Minority' for the purposes of this test. This means that before calling permutation_test, you'll need to create a function and use it with .apply to create a DataFrame with a new column containing only the values 'Not Minority' and 'Minority'.
In the cell provided, assign the variable living_test to 1, 2, or 3, according to the instructions at the top of Section 4.
Do all of your work in separate cells, and remember to change the format of any cells you added to "Raw NBConvert" before submitting.
Nice work! You've now compared many different groups of characters. See how having a function to do the permutation testing in general was really helpful? If there's any other groups of characters you're curious about, you can use the same framework to explore some more.
Before moving on, make sure that any cells you added are in "Raw NBConvert" format, and that you didn't accidentally change any provided cells to "Raw NBConvert".
Part 3: Fact or Cap? 🧢
In this part, we'll use the power of the bootstrap to evaluate the validity of two claims involving character demographics.
Section 5: What Could Have Been... 💭
DC Bigshot, the comics forum user from Part 1, is still bitter about something that happened in the 1980s. DC Bigshot rants online that they had a brilliant idea for a new comic character but the character never made it past management. The character would have had red hair and blue eyes. DC Bigshot claims that the character probably would have been a big hit and had several hundred appearances by now.

We want to get a sense of how many appearances typical red-haired, blue-eyed DC characters from the 1980s have, by bootstrapping to estimate the median number of appearances of such characters. We'll treat the data in our dc DataFrame as a random sample from a larger population that includes more DC characters. Our goal is to use this sample to estimate a population parameter – the median number of appearances of red-haired, blue-eyed DC characters from the 1980s.
Question 5.1. Assign with_decade to a DataFrame with all of the columns in dc, plus a new 'DECADE' column of type int containing the decade in which each character was introduced.
For example, all characters with a 'YEAR' value between 1940 and 1949 should have a 'DECADE' value of 1940. If the 'YEAR' is recorded as 0, the 'DECADE' should also be 0.
Question 5.2. Now, assign dc_red_blue_80s to a DataFrame containing only DC characters with red hair and blue eyes from the 1980s for whom we know the number of appearances.
Question 5.3. ⭐⭐ The rows in dc_red_blue_80s constitute our sample of DC characters with red hair and blue eyes from the 1980s. Below, use the bootstrap procedure to generate 5000 bootstrapped resamples of this sample. Compute the median 'APPEARANCES' of each resample, and store these medians in the array boot_medians.
Run the cell below to visualize the distribution of your bootstrapped medians.
Question 5.4. What can we conclude about the histogram above? Assign q5_hist to 1, 2, or 3.
This histogram is not especially bell-shaped, but it would look more bell-shaped if we did more repetitions of the bootstrap.
This histogram is not especially bell-shaped, but it would look more bell-shaped if we had started with a larger sample.
Even if we increased the sample size and number of repetitions, this histogram probably wouldn't look bell-shaped.
Question 5.5. Assign left_endpoint and right_endpoint to the left and right endpoints of a 95% confidence interval for the true median number of appearances of all DC characters with red hair and blue eyes from the 1980s.
Question 5.6. Which of the following is a correct interpretation of our results? Assign q5_interpretation to 1, 2, 3, or 4.
There is a 95% chance that DC Bigshot's character would have had between
left_endpointandright_endpointappearances.95% of red-haired blue-eyed DC characters from the 1980s had between
left_endpointandright_endpointappearances.There is a 95% chance that the median number of appearances of red-haired blue-eyed DC characters from the 1980s falls between
left_endpointandright_endpoint.None of the above.
Section 6: Nonbinary Characters 🏳️🌈
Lately, Marvel has come under scrutiny for having very few nonbinary characters. A nonbinary individual is someone who does not identify as male or female. Marvel's CEO responds to the criticism with a statement emphasizing their commitment to enhancing character diversity. As part of this statement, the CEO states that while the proportion of their characters that are neither male nor female is admittedly small, that's just a reflection of reality, since the proportion of nonbinary people in the United States is small.
Let's investigate this claim by looking at some data. To start, let's determine the proportion of Americans that are nonbinary. A recent pioneering study by the Williams Institute at UCLA estimated the number of nonbinary American adults to be 1.2 million. According to the 2020 US Census, there are 258.3 million American adults. Thus, the proportion of American adults that are nonbinary is:
Question 6.1. In the marvel DataFrame, what proportion of characters with available data in the 'SEX' column are nonbinary? Save your result as nonbinary_prop_marvel.
It appears that in the sample of Marvel characters for which we have data, the proportion of nonbinary characters is slightly different than the proportion of nonbinary Americans, nonbinary_prop_reality. But is this difference present in the population of all Marvel characters, or just in our sample? Let's conduct a hypothesis test to find out.
Null Hypothesis: The proportion of nonbinary Marvel characters equals the proportion of nonbinary Americans.
Alternative Hypothesis: The proportion of nonbinary Marvel characters is not equal to the proportion of nonbinary Americans.
Since we were able to set up our hypothesis test as a question of whether a certain population parameter – the proportion of nonbinary characters among all Marvel characters – is equal to a certain known value, we can test our hypotheses by constructing a confidence interval for the parameter. We'll test our hypotheses at a 1% p-value cutoff, meaning we'll need to construct a 99% confidence interval.
To construct a 99% confidence interval for the proportion of nonbinary characters among all Marvel characters, we need to bootstrap the sample of data we have and create many estimates for that population proportion, then take the middle 99% of those estimates.
Question 6.2. Before we can conduct this hypothesis test, we need a column that tells us whether a character is nonbinary.
Below, assign nonbinary_df to a DataFrame with the same columns as marvel but with an additional column, 'NONBINARY', that contains the value 1 for nonbinary characters and 0 for male or female characters. Only include rows where we have data on the 'SEX' of the character.
Question 6.3. ⭐⭐⭐ Now, implement the bootstrap procedure to create an array called boot_proportions containing 10,000 estimates for the proportion of nonbinary Marvel characters.
Instead of using .sample with replace=True like you'd normally do for bootstrapping, here's a clever strategy that allows you to calculate resample proportions without the .sample method, which can be slow.
Since the 'NONBINARY' column contains only 0s and 1s and bootstrapping requires us to sample with replacement, this means each element of our resample has a certain probability of being a 0 and some other probability of being a 1. These probabilities should add up to one. You can think of resampling from a sample of 0s and 1s as a lot like flipping a biased coin. You can find the proportion of 1s in your resample in much the same way you might find the proportion of heads in many coin flips, using np.random.multinomial. (It's also possible to do this with np.random.choice, but please use np.random.multinomial here.)
This alternate strategy adds up to a huge time savings. With .sample, it takes about 1 minute to do 100 repetitions of the bootstrap on a DataFrame of this size, which means to do 10,000 repetitions, it would take about 100 minutes (over an hour 🥱). With np.random.multinomial, you should be able to do 10,000 repetitions of the bootstrap in only a few seconds!
Do not use .sample or np.random.choice for this question. Instead, you must use np.random.multinomial; think carefully about the probability distribution you will provide as its second argument.
Run the cell below to visualize the distribution of your bootstrapped proportions.
Question 6.4. Assign nonbinary_left and nonbinary_left to the left and right endpoints of a 99% confidence interval for the true proportion of nonbinary Marvel characters.
Question 6.5. Use your confidence interval to decide whether to reject the null hypothesis at a 1% p-value cutoff. Set reject_null to True if we should reject the null hypothesis, and False if not. Then, assign q6_interpretation to either 1, 2, 3, or 4, depending on which of the following four statements is best supported by the data.
The CEO was definitely wrong.
The CEO was likely wrong.
The CEO was likely right.
The CEO was definitely right.
Part 4: Fun and Games 🎮
In the last part of the project, we will switch our focus to probability.
Section 7: Guess Who? 🤔
You and your friend like quizzing each other on your knowledge of comic characters from both companies. Your friend chooses a single character at random from the all_characters DataFrame. They then tell you one piece of information about that character, and you have to guess which character they're talking about.
Note: All of Section 7 relies on all_characters being defined correctly, so make sure you've completed Question 2.1 correctly before proceeding. For this section, do not filter out rows where the character’s 'SEX' is 'Data Unavailable'.
Question 7.1. Your friend picks a character at random and tells you that they have blue eyes. What is the probability that the character is Superman (Clark Kent)? Assign your answer to the variable p_superman_given_blue_eyes.
Hint: Start by determining the number of characters with blue eyes.
Question 7.2. As you saw above, the probability of correctly guessing your friend's character given just one piece of information is extremely low. So, instead of guessing the name of the character your friend is talking about, you will try and guess some other information about them, like whether or not they are good, or what their hair color is.
Your friend picks a character at random and tells you that they're a DC character. What is the probability that they are a good character (meaning that their value in the 'ALIGN' column is 'Good Characters')? Assign your answer to the variable p_good_given_dc.
Question 7.3. Now your friend picks a character at random and tells you that they're a good character (defined the same way as in the previous question). What is the probability that they are a DC character? Assign your answer to the variable p_dc_given_good.
Question 7.4. ⭐⭐ In both of the previous two questions, the code you wrote likely looked similar. Let's generalize these calculations so that we can more easily compute conditional probabilities.
In this question, you'll implement the function conditional_probability. It has two arguments, find and given, both of which are lists. Let's walk through how it works, using an example – suppose we want to use it to compute the probability that a randomly selected character from all_characters is from DC, given that they are good. (Note that this is the same probability that you computed in the previous question.)
findis a list of two elements:The first element in
findis the column inall_charactersthat contains the event that we are trying to find the probability of. This can be any column inall_characters; in our example, this is'COMPANY'.The second element in
givenis the value in the aforementioned column that we're trying to find; in our example, this is'DC'.
givenis a list of two elements:The first element in
givenis the column inall_charactersthat contains the event that we are given. This can also be any column inall_characters; in our example, this is'ALIGN'.The second element in
givenis the value in the aforementioned column; in our example, this is'Good Characters'.
Putting this all together, this means that conditional_probability(['COMPANY', 'DC'], ['ALIGN', 'Good Characters']) should evaluate to your answer from the previous question (but conditional_probability should work for any example, not just this one).
Question 7.5. Now, use the function conditional_probability to determine the following two probabilities:
p_blue_eyes_given_black_hair: the probability that a randomly selected character has blue eyes given that they have black hair.p_black_hair_given_blue_eyes: the probability that a randomly selected character has black hair given that they have blue eyes.
Question 7.6. In the previous question, you computed two probabilities. Just by looking at those two probabilities, is it possible to determine which of the numbers below is larger?
The number of characters with blue eyes in
all_characters.The number of characters with black hair in
all_characters.
Below, set can_determine to True if is possible to determine which number is larger based on these probabilities alone, and False if not.
Question 7.7. ⭐⭐⭐ Your friend realizes that you're still pretty bad at this guessing game, and instead starts to give you multiple characteristics about their randomly selected character. However, the function conditional_probability only allows for a single given characteristic.
In this question, you'll complete the implementation of conditional_probability_multiple, which takes in two arguments, find and given_list.
The list
findis formatted the same way as it is forconditional_probability.The list
given_listis a list of lists. Each of the lists insidegiven_listis formatted in the same way that thegivenlist was formatted forconditional_probability; each list corresponds to a single condition.
For instance,
computes the probability that a randomly selected character from all_characters is good, given that they are a DC character and are not a gender or sexual minority.
Question 7.8. Using conditional_probability_multiple, determine the probability that a randomly selected character from all_characters is from DC given that they have red hair and blue eyes. Assign your answer to the variable p_dc_given_red_blue.
Section 8: BuzzFeed 🐝
Your friend gets tired of quizzing you about comic characters. You both decide to instead take a BuzzFeed quiz titled "Which Marvel Character Are You Internally, And Which Are You Externally?". (If you're looking to take a break from working on the project, take the quiz!)
A question from the BuzzFeed quiz.The way the quiz works is that you answer a few questions, and it gives you back the names of two different Marvel characters, one that represents your "internal" personality and one that represents your "external" personality. You and your friend notice something weird – you both selected the same answers to all questions, yet you got different results.
It turns out that the quiz actually does nothing with your answers. Instead, it randomly shows you characters for your internal and external personalities according to the following probability distributions:
| Character | Probability for Internal | Probability for External |
|---|---|---|
| Spider-Man | 0.4 | 0.25 |
| Captain America | 0.12 | 0.15 |
| Wolverine | 0.08 | 0.37 |
| Iron Man | 0.11 | 0.08 |
| Thor | 0.29 | 0.15 |
Each time the quiz is completed, results are generated randomly according to the distributions listed above, separately for internal and external characters. Therefore, it is possible to get the same character for both your internal and external personality.
Note that all questions in this section are math questions, not coding questions.
Question 8.1. You take the quiz once. What is the probability that the quiz tells you that you're Wolverine internally and Spider-Man externally? Assign your answer to the variable p_wolverine_spiderman.
Question 8.2. You take the quiz once. What is the probability that you get Iron Man as one of your characters and Thor as the other? Assign your answer to the variable p_iron_thor.
For your convenience, we've repeated the distribution table from the start of this section below.
| Character | Probability for Internal | Probability for External |
|---|---|---|
| Spider-Man | 0.4 | 0.25 |
| Captain America | 0.12 | 0.15 |
| Wolverine | 0.08 | 0.37 |
| Iron Man | 0.11 | 0.08 |
| Thor | 0.29 | 0.15 |
Question 8.3. You take the quiz once. What is the probability that the two characters the quiz gives you are different? Assign your answer to the variable p_both_different.
Question 8.4. You and five other friends (so 6 people total) each take the quiz once. What is the probability that the quiz tells at least one person that they are Iron Man internally? Assign your answer to the variable p_ironman_internal.
Question 8.5. Again, suppose you and five other friends (so 6 people total) each take the quiz once. What is the probability that the quiz tells at least one person that they are Iron Man internally or externally? In other words, what is the probability that Iron Man appears at least once among the 12 characters that you and your friends receive? Assign your answer to the variable p_ironman_internal_external.
Section 9: Action Figure Bundles 💥
This holiday season, Marvel and DC decide to put aside their differences and produce bundles of action figures consisting of characters from both companies.
The bundles will consist of randomly selected characters from among the 40 characters with the most appearances, across both Marvel and DC. As evidenced by the code below, of the top 40 characters, 10 are from DC and 30 are from Marvel.
To create a bundle, we select 5 characters from this set of 40, in a way such that each of the 40 characters is equally likely to be chosen, and that characters can only be selected once.
Question 9.1. What is the probability that a bundle of 5 characters consists solely of DC characters? Assign your answer to the variable p_bundle_dc_only.
Note: This is a math question, not a coding question.
Question 9.2. ⭐⭐ Now we're interested in determining the probability that a bundle of 5 characters consists of 2 DC characters (and thus 3 Marvel characters). In future data science courses, you will learn to compute probabilities like these exactly, but for now we'll turn to the power of simulation to approximate this probability.
Below, simulate 100,000 times the act of creating a bundle of 5 randomly selected characters. In each simulation, determine whether or not the number of DC characters was 2. Set p_2_dc to the approximate probability of this happening.
Hint: Start by defining an array, all_40_characters, that contains the value 'DC' 10 times and the value 'Marvel' 30 times. Do not create this array by typing 40 strings manually. Also, remember that the same character cannot appear in a bundle twice.
Question 9.3. ⭐⭐ We're now interested in the probability that a bundle of 5 characters contains a different number of DC characters, not just 2. To do this, rather than writing many different simulations, your job is to write a single simulation that 100,000 times generates a bundle, counts the number of DC characters in that bundle, and stores the result in the array simulated_dc_counts. At the end, simulated_dc_counts will contain the number of DC characters in many simulated bundles, and you can use it to approximate the probability of a bundle containing 0, 1, 2, 3, 4, or 5 DC characters.
Complete the simulation below.
Now that you've completed the simulation, run the cell below to see the empirical distribution of the number of DC characters in a 5 character bundle. You should notice that the probability that all 5 characters are from DC is quite low, as you discovered in Question 9.1.
Question 9.4. In the histogram above, the most likely number of DC characters in a bundle should be clear. Using simulated_dc_counts, assign p_most_frequent to an estimate of the probability that a bundle contains this most likely number of DC characters. (As a reminder, the probability is an estimate because we’re computing it through a simulation rather than with math.)
It doesn't seem like DC is getting a great deal out of this arrangement!
Congratulations! You've completed the Final Project – the last assignment of this course!
If you're interested in learning more about the data and analysis that inspired this project, check out the article Comic Books Are Still Made By Men, For Men And About Men from FiveThirtyEight. Here's a few visualizations from their analysis you may find interesting.

If you're not exactly in the mood to look at more data after completing this project, we don't blame you. How about winding down with a comic book or movie? Marvel's new movie, Black Panther: Wakanda Forever, is in theaters now! Or take a trip to Balboa Park to visit the Comic Con Museum.
Submission Instructions
Make sure that any cells you added in Question 4 are in "Raw NBConvert" format, and that you didn't accidentally change any provided cells to "Raw NBConvert".
As usual, follow these steps to submit your assignment:
Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells.
Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope. Don't forget to add your partner to your group on Gradescope!
If running all the tests at once causes a test to fail that didn't fail when you ran the notebook in order, check to see if you changed a variable's value later in your code. Make sure to use new variable names instead of reusing ones that are used in the tests.