GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/homeworks/hw01/hw01.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

Homework 1: Causality and Basic Python

Due Tuesday, October 4th at 11:59PM

Welcome to Homework 1! This week's HW will cover causality and basic Python. You can find additional help on these topics in Chapter 2 of Computational and Inferential Thinking and BPD 1-6 in the babypandas notes.

(source)

Instructions

This assignment is due Tuesday, October 4th at 11:59PM. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

Remember to start early and submit often.

Important: For homeworks, the otter tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (the schedule can be found here) or EdStem. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

In [ ]:

# Please don't change this cell, but do make sure to run it.
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np
import otter
grader = otter.Notebook()

plt.style.use('ggplot')

1. College Graduates in Alaska 🎓

According to the USDA Economic Research Service, about 108 million people in the United States of America have completed college (received a bachelor's degree or higher). Only 220,017 of those people lived in Alaska. That's a proportion of $\frac{220{,}017}{108{,}000{,}000} = 0.00204$ , or $0.204\%$ , which certainly doesn't sound like a lot.

However, it's hard to evaluate the meaning of this data without more information. If you could request one additional piece of data (one number) to better understand the education level of Alaskans as compared to all Americans, what would you want to know? Explain how you would use that piece of data to determine whether there are more people with a college degree in Alaska as compared to elsewhere in the US.

Note: This is a manually graded question. It will not be "autograded"; our tutors will read and grade your work.

Type your answer here, replacing this text.

2. Characters in Winnie-the-Pooh 🧸🍯🐷

In Lecture 1, we counted the number of times that the characters Amy, Beth, Jo, Meg, and Laurie were named in each chapter of the classic book, Little Women. In this question, we'll look at another classic book – Winnie-the-Pooh (1926) by A. A. Milne and illustrated by Ernest H. Shepard. At the start of 2022, the copyright protections of the original book (but not the Disney franchise!) expired, and so sites like Project Gutenberg are now able to post copies of the book without violating any copyright laws. Click here to read a version of the book that has all of its original illustrations!

Four of the main characters in Winnie-the-Pooh are Pooh (🧸), Piglet (🐷), Eeyore (🐴), and Christopher Robin (🧍).

Below, we've written code that shows the number of mentions of each of these four characters in each of the 10 chapters of the book. However, instead of displaying this information in a scatter plot, as was done in Lecture 1, we will use a bar chart.

Run the cell below.

In [ ]:

# This cell contains code that hasn't yet been covered in the course.
# It isn't expected that you'll understand the code, but you should be able to 
# interpret the bar chart it generates.

# Open the book and split it into chapters
book_file = 'data/winnie-the-pooh.txt'
raw_book = open(book_file, encoding="utf-8").read()
end_pos = raw_book.index('*** END OF THE PROJECT GUTENBERG EBOOK WINNIE-THE-POOH ***')
chapters = raw_book[:end_pos].split('CHAPTER ')[1:]

# Find the number of words in each chapter
chapter_lengths = (np.array([len(c.split(' ')) for c in chapters]) / 100)

# Find the number of mentions per 100 words for each character and chapter
characters = bpd.DataFrame().assign(
    Chapter=np.arange(1, 11),
    Pooh=np.char.count(chapters, 'Pooh') / chapter_lengths,
    Piglet=np.char.count(chapters, 'Piglet') / chapter_lengths,
    Eeyore=np.char.count(chapters, 'Eeyore') / chapter_lengths,
    Christopher=np.char.count(chapters, 'Christopher') / chapter_lengths
)

characters.plot(kind='bar', x='Chapter', figsize=(14, 8));
plt.ylabel('Mentions per 100 words in chapter');

Looking at the bar chart, we see that the height of the bar for Pooh in Chapter 1 is 1.5 and the y-axis of this graph is “Mentions per 100 words in chapter”; this means that 1.5 of every 100 words in Chapter 1 are "Pooh" (or in other words, $1.5\%$ of the words in Chapter 1 are "Pooh").

Question 2.1. The very first time Eeyore’s name is used in the story is in the following sentence:

The Old Grey Donkey, Eeyore, stood by himself in a thistly corner of the forest, his front feet well apart, his head on one side, and thought about things.

In which chapter is this sentence? Assign the variable pooh_part1 to an integer between 1 and 10.

In [ ]:

pooh_part1 = ...

In [ ]:

grader.check("q2_1")

Question 2.2. Pooh is mentioned 47 times in Chapter 5. How many times is Christopher mentioned in Chapter 5? Assign the variable pooh_part2 to 1, 2, 3, 4, or 5.

In [ ]:

pooh_part2 = ...

In [ ]:

grader.check("q2_2")

Question 2.3. Which of the following is a valid conclusion we can make just based off of the above plot? Assign pooh_part3 to 1, 2, 3, 4, or 5. There is only one correct answer.

Piglet is mentioned more times in Chapter 3 than he is in Chapter 5.
The chapter that Pooh is mentioned the most in is Chapter 2.
Christopher is mentioned roughly the same number of times in each of Chapters 1, 2, and 3.
Pooh is mentioned roughly the same number of times in Chapters 4 and 5.
Christopher and Eeyore are mentioned roughly the same number of times in Chapter 10.

In [ ]:

pooh_part3 = ...

In [ ]:

grader.check("q2_3")

Note: The tests in this section only check that you set each variable to a number in the correct range. Unlike in labs, tests in homeworks do not check that you answered correctly; they only check that your answer is reasonable, or in the correct format. To put it another way: all of your tests might pass, but that doesn't mean you'll get full credit – some of your answers may still be wrong. It's up to you to make sure that they're right!

3. Python Basics 🐍

Question 3.1. When you run the following cell, Python produces a cryptic error message.

In [ ]:

2022 = 2020 + 2.0

Choose the best explanation of what's wrong with the code, and then assign 1, 2, 3, or 4 to basics_part1 below to indicate your answer.

Python is not able to add an int to a float because they are of different data types.
The left hand side is an int, while the right hand side is a float. It should be 2022.0 = 2020 + 2.0.
The result should be written after the calculation. It should be 2020 + 2.0 = 2022.
This is creating a variable called 2022, which doesn't make sense because 2022 is a number.

Important: Once you have finished this question, "comment" out the above code cell out by replacing it with # 2022 = 2020 + 2.0. This will prevent the error message from appearing when your notebook is graded.

In [ ]:

basics_part1 = ...

In [ ]:

grader.check("q3_1")

Question 3.2. Consider the following poorly-written code.

In [ ]:

three = 3
three = three * three
three = three + three
three = three * three
three = -three

As this code executes, what values does the variable three take on? Assign 1, 2, 3, or 4 to basics_part2 to indicate your answer.

The variable three takes on the values 3, 9, 18, 324, -324.
The variable three takes on the values 3, 9, 81, 243, -243.
The variable three takes on the values 3, 6, 12, 36, -36.
The variable three takes on the values 3, 9, 18, -54, 54.

In [ ]:

basics_part2 = ...

In [ ]:

grader.check("q3_2")

4. Road Trip 🚘

You and your friend recently went on a road trip, and you want to perform some calculations on data you gathered throughout your journey. Answer the questions below, using Python to perform all the intermediate calculations, such as adding, squaring, and dividing.

Note that the math package has not been imported. You don't need it for this question, and you should not import it, otherwise the autograder may produce an error.

Question 4.1. On the first day of the trip, your friend drove the car at three different speeds, for varying lengths of time, as shown below:

Journey	Speed (miles per hour)	Time (hours)
Part 1	18	2
Part 2	47	1
Part 3	65	4

Using this information, calculate the average speed, in miles per hour, at which your friend drove the car that day, and assign your answer to the variable means_part1. Recall from math and physics that average speed is the total distance divided by total time.

In [ ]:

# Feel free to define intermediate variables to use in your solution.


total_distance = ...
total_time = ...
means_part1 = ...
means_part1

In [ ]:

grader.check("q4_1")

Question 4.2. On the second day of the trip, your friend drove the car three times again, but this time at the speeds and distances seen below:

Journey	Speed (miles per hour)	Distance (miles)
Part 1	18	2
Part 2	47	1
Part 3	65	4

Using this information, calculate the average speed, in miles per hour, at which your friend drove the car that day, and assign your answer to the variable means_part2.

Note that the third column is Distance (miles), not Time (hours). Unlike in Question 4.1, you aren't given the amount of time that each part of the journey took; you need to compute these times yourself. To calculate the time taken for each part of the journey, divide the distance for that part by the speed for that part. Finally, add up the times for the three parts of the trip to find the total time.

In [ ]:

# Feel free to define intermediate variables to use in your solution.


total_distance = ...
total_time = ...

means_part2 = ...
means_part2

In [ ]:

grader.check("q4_2")

Question 4.3. On the way back home, your friend stops at a pet store to buy an aquarium. The only one available is a rectangular tank, which unfortunately doesn't fit in the car because of your suitcases. This tank has a height of 18 inches, a width of 47 inches, and a length of 65 inches.

Your friend thinks that the aquarium would have fit in the car with all your suitcases if it had the same volume, but was shaped as a cube instead. What would the length of each side of such an aquarium be in inches? Save your answer in the variable means_part3.

In [ ]:

# Feel free to define intermediate variables to use in your solution.


means_part3 = ...
means_part3

In [ ]:

grader.check("q4_3")

In this problem, though you calculated three different quantities in three different ways, all of your results are actually considered means, of various kinds!

In Question 4.1., given $n$ values $x_1, x_2, ..., x_n$ , you found an arithmetic mean, using the formula ${x_1+x_2+...+x_n \over n},$ where the numerator represented total distance and the denominator represented total time. An arithmetic mean is the usual type of mean or average you're used to seeing. It turns out that you actually computed a more sophisticated arithmetic mean, known as a weighted arithmetic mean, $\frac{w_1 x_1 + w_2 x_2 + ... + w_n x_n}{w_1 + w_2 + ... + w_n}$ where the weights $w_1, w_2, w_3$ were the times travelled in each part of the journey.

In Question 4.2., given $n$ values $x_1, x_2, ..., x_n$ , you found a harmonic mean, using the formula ${n \over {{1 \over x_1}+{1 \over x_2}+ ... + {1 \over x_n}}},$ where the numerator represented total distance and the denominator represented total time. To calculate the total time, you needed to sum the time taken for each part of the trip, calculated using the fact that time is distance over speed. Again, it turns out that you actually computed the weighted harmonic mean, but this time the weights were the distances travelled. If you're curious, see the formula here.

Finally in Question 4.3., given $n$ values $x_1, x_2, ..., x_n$ , you found a geometric mean, using the formula ${\sqrt[n]{x_1 \cdot x_2 \cdot ... \cdot x_n}},$ where each value represented a dimension of the rectangular tank.

As you can see, there are many different of notions of mean. You'll learn about some of them if you take DSC 40A!

5. Beverage Consumption Among Youth 🧃

In this problem, we want to quantify how dissimilar three different age categories (little kids, big kids, and teens) are, in terms of their beverage consumption, using three commonly consumed beverages (water, milk, and soft drinks).

The data below comes from the CDC's Beverage Consumption Among Youth in the United States, 2013-2016.

Percent of Total Beverage Consumption	Little Kids (Ages 2-5)	Big Kids (Ages 6-11)	Teens (Ages 12-19)
Water	39.5	41.9	47.0
Milk	32.1	24.4	14.5
Soft Drinks	13.0	20.9	22.3

We define the dissimilarity between two age groups as the largest absolute difference between their 3 respective consumption percentages.

To better understand dissimilarity, consider the following hypothetical situation.

Age group A's consumption of water is 10 percent more than age group B's.
Age group A's consumption of milk is 3 percent less than age group B's.
Age group A's consumption of soft drinks is 7 percent less than age group B's.

Here, we would say the dissimilarity between age group A and age group B is 10, since 10 is larger than both 3 and 7.

Question 5.1. Using this method, compute the dissimilarity between little kids and big kids. Assign the result to the variable dissimilarity. Use a single expression (a single line of code) to compute the answer. Let Python perform all the arithmetic (like subtracting) rather than simplifying the expression yourself.

Hint: The built-in abs function computes absolute values.

In [ ]:

dissimilarity = ...
dissimilarity

In [ ]:

grader.check("q5_1")

Question 5.2. Which pair of age groups is most dissimilar, according to this measurement? Assign either 1, 2, or 3 to the variable most_dissimilar below.

little kids and big kids
big kids and teens
little kids and teens

In [ ]:

most_dissimilar = ...

In [ ]:

grader.check("q5_2")

Question 5.3. It turns out that if we eliminated a certain one of the three beverage percentages in the table (for example, getting rid of the soft drinks row) and recalculated dissimilarities based on the remaining two percentages only, we would find the dissimilarity between each pair of age groups to be the same as if we had used all three percentages. In other words, one of the three rows of the table ends up not factoring into the calculation for dissimilarity for all three pairs of age groups.

Which percentage can be eliminated without changing the dissimilarity of any pair of age groups in the table? Assign either 1, 2, or 3 to the variable disposable below.

The consumption percentage of water.
The consumption percentage of milk.
The consumption percentage of soft drinks.

In [ ]:

disposable = ...

In [ ]:

grader.check("q5_3")

6. COVID-19 and Brain Damage 🦠🧠

A study released on September 23, 2022 found that people who had COVID-19 have a higher risk for a host of brain injuries than those who never had COVID-19. This Reuters article is a good summary of the study.

The original research article by Evan Xu, Yan Xie & Ziyad Al-Aly published in Nature Magazine states:

"Our results show that in the postacute phase of COVID-19, there was increased risk of an array of incident neurologic sequelae including ischemic and hemorrhagic stroke, cognition and memory disorders, peripheral nervous system disorders, episodic disorders (for example, migraine and seizures), extrapyramidal and movement disorders, mental health disorders, musculoskeletal disorders, sensory disorders, Guillain–Barré syndrome, and encephalitis or encephalopathy. We estimated that the hazard ratio of any neurologic sequela was 1.42 (95% confidence intervals 1.38, 1.47) and burden 70.69 (95% confidence intervals 63.54, 78.01) per 1,000 persons at 12 months. The risks and burdens were elevated even in people who did not require hospitalization during acute COVID-19. Limitations include a cohort comprising mostly White males. Taken together, our results provide evidence of increased risk of long-term neurologic disorders in people who had COVID-19."

Question 6.1. Does this study establish that COVID-19 causes brain damage?

If you believe the answer is yes, set the variable covid_q1 to 1; If you believe the answer is no, set covid_q1 to 2.

In [ ]:

covid_q1 = ...

In [ ]:

grader.check("q6_1")

Question 6.2. Do you think this article describes an observational study or a randomized controlled experiment?

If you believe this is an observational study, set the variable covid_q2 to 1; If you believe this is a randomized controlled experiment, set covid_q2 to 2.

In [ ]:

covid_q2 = ...

In [ ]:

grader.check("q6_2")

Question 6.3. How could we establish whether COVID-19 does indeed cause lasting brain damage? Choose the best answer and assign either 1, 2, or 3 to the variable covid_q3 below.

We could do an observational study to test this.
We could do a randomized controlled trial to test this.
None of the above.

In [ ]:

covid_q3 = ...

In [ ]:

grader.check("q6_3")

7. The Taste of Kale Makes Fetuses Grimace? 🤢🥬

The news article below (source) discusses whether fetuses differentiate specific flavours. The full study, published September 21, 2022 can be found here.

"The team looked at ultrasound scans from almost 70 pregnant women, aged 18 to 40 from the north-east of England, who were split into two groups. One group was asked to take a capsule of powdered kale 20 minutes before an ultrasound scan, and the other was asked to take a capsule of powdered carrot. Vegetable consumption by the mothers did not differ between the kale and carrot group.

The team then carried out a frame-by-frame analysis of the frequency of a host of different facial movements of the foetuses, including combinations that resembled laughing or crying. Overall, the researchers examined 180 scans from 99 foetuses, scanned at either 32 weeks, 36 weeks, or at both time points.

Among the results, the team found foetuses showed a crying expression about twice as often when the mother consumed a kale capsule compared with a carrot capsule or no capsule. When the mother consumed a carrot capsule however, the foetuses adopted a laughter-like expression about twice as often as they did when either a kale capsule or no capsule was swallowed by the mother.

But she cautioned the pregnant women were not randomised to experimental or control groups, and that prior exposure of the foetuses in the control group to different vegetables – including carrots and kale – was not known."

Question 7.1. Was this an observational study, a randomized controlled trial, or neither?

If you believe this was an observational study, set the variable kale_q1 to 1; if you believe this was a randomized controlled trial, set kale_q1 to 2; if you believe this was neither, set kale_q1 to 3.

In [ ]:

kale_q1 = ...

In [ ]:

grader.check("q7_1")

Question 7.2. Consider the last sentence of the article excerpt.

"But she cautioned the pregnant women were not randomised to experimental or control groups, and that prior exposure of the foetuses in the control group to different vegetables – including carrots and kale – was not known."

Why do you think the authors disclose this confounding factor despite the evidence that her team collected? Choose the best answer and assign either 1, 2, or 3 to the variable kale_q2 below.

To let the reader know there may be no relationship between ingested flavor and fetal response.
To let the reader know that the relationship between ingested flavor and fetal response may be non-causal.
To let the reader know the relationship between ingested flavor and fetal response may be causal.

In [ ]:

kale_q2 = ...

In [ ]:

grader.check("q7_2")

8. Concussions in Athletes ⛹️‍♀️

The following is an excerpt from a news article on the effects of recent legislation intended to prevent concussions in athletes.

“Since 2014, all 50 states and the District of Columbia have passed laws to protect young athletes against traumatic brain injury (TBI). Washington State was the first in 2009.

Most of the laws require athletes with suspected concussions to stop playing until a doctor clears them to return. Coaches, players, and parents must also receive yearly education about concussions.

Between fall 2005 and spring 2016, student athletes reported about 2.7 million concussions. Of those, 89 percent were new and 11 percent were repeat injuries.

In 2005, nearly 135,000 initial concussions were reported. The number jumped to more than 360,000 by 2016.

After concussion laws were introduced, however, repeat injuries fell dramatically, from about 14 percent of all concussions in 2005 to roughly 7 percent in 2016."

Which of the following is the most likely explanation for the fact that initial concussions nearly tripled from 135,000 in 2005 to 360,000 in 2016? Choose the best answer and assign either 1, 2, 3, or 4 to the variable athletes below.

An increase in the danger of athletics.
An increase in the number of athletes.
An increase in awareness about concussions.
An increase in the population of the United States.

In [ ]:

athletes = ...

In [ ]:

grader.check("q8")

9. Randomized Controlled Experiments 🎲

A researcher wants to run a randomized controlled trial to answer one of the questions below. Which of these questions is she not able to answer via RCT? Choose the best answer and assign either 1, 2, 3, or 4 to the variable no_rct below.

Does attending a programming bootcamp cause students to get better grades in a programming course?
Does drinking cold water help with muscle recovery after a workout?
Does playing the lottery cause heart disease?
Does being convicted of a crime lower self-esteem?

In [ ]:

no_rct = ...

In [ ]:

grader.check("q9")

Finish Line: Almost there, but make sure to follow the steps below to submit! 🏁

Make sure to comment out the code in Question 3.1 that causes an error.
Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells.
Read through the notebook to make sure all cells ran and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.
For a lab, the grade you see on Gradescope is your score on the assignment. For a homework or a project, the grade you see on Gradescope is not your final score. We will run correctness tests after the assignment's due date has passed.

In [ ]:

grader.check_all()