GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/labs/lab04/lab04.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

Lab 4: DataFrames, Control Flow, and Probability

Due Saturday, October 22th at 11:59PM

Welcome to Lab 4! This week, we will go over more DataFrame manipulation techniques, conditionals and iteration, and introduce the concept of randomness. This lab is due on Saturday, October 22th at 11:59PM.

Refer to the following readings:

Grouping with subgroups (see BPD 11.4)
Merging DataFrames (see BPD 13)
Conditional statements (see CIT 9.1)
Iteration (see CIT 9.2)
Probability (see CIT 9.5)

First, set up the tests and imports by running the cell below.

In [ ]:

import numpy as np
import babypandas as bpd

# These lines set up graphing capabilities.
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import otter
grader = otter.Notebook()

%reload_ext pandas_tutor

1. California National Parks 🏞️ 🐻

In this question, we'll take a closer look at the DataFrame methods merge and groupby.

We will be working with two datasets, california_parks.csv (stored as parks) and california_parks_species.csv (stored as species), which provide information on California National Parks and the species of plants and animals found there, respectively. These are a subset of a larger dataset the National Parks Service provides. We've also created a third DataFrame, parks_species, that contains the number of species per park.

Run the cell below to load in our data.

In [ ]:

parks = bpd.read_csv("data/california_parks.csv")
species = bpd.read_csv("data/california_parks_species.csv")
parks_species = bpd.DataFrame().assign(
    count=species.groupby('Park Name').count().get('Category')
)

Right now, the information we have on each California National Park is split across two DataFrames. The parks DataFrame has the code, state, size, and location of each park, and the parks_species DataFrame contains the number of species at each park. Run the cells below to see both DataFrames.

In [ ]:

parks

In [ ]:

parks_species

Question 1.1. Below, use the merge method to create a new DataFrame named parks_with_species, which will have the parks' existing information along with the number of species each has. Make sure the DataFrame only has one row per park. Your DataFrame should look like this:

	Park Code	Park Name	State	Acres	Latitude	Longitude	count
0	CHIS	Channel Islands National Park	CA	249561	34.01	-119.42	1885
1	JOTR	Joshua Tree National Park	CA	789745	33.79	-115.9	2294
2	LAVO	Lassen Volcanic National Park	CA	106372	40.49	-121.51	1797
3	PINN	Pinnacles National Park	CA	26606	36.48	-121.16	1416
4	REDW	Redwood National Park	CA	112512	41.3	-124	6310
5	SEKI	Sequoia and Kings Canyon National Parks	CA	865952	36.43	-118.68	1995
6	YOSE	Yosemite National Park	CA	761266	37.83	-119.5	2088

In [ ]:

parks_with_species = ...
parks_with_species

In [ ]:

grader.check("q1_1")

Now, let's take a look at the species DataFrame. Each park has a lot of different species, and each species varies in abundance at each park.

In [ ]:

species

Question 1.2. Using the groupby method, assign the variable species_abundance to a DataFrame that classifies the parks by both Park Name and Abundance.

Hint: Reset the index and assign columns so that you have three columns: 'Park Name', 'Abundance', and 'Category'. The first few rows of your DataFrame should look like this:

	Park Name	Abundance	Category
0	Channel Islands National Park	Abundant	48
1	Channel Islands National Park	Common	228
2	Channel Islands National Park	Occasional	190
3	Channel Islands National Park	Rare	368
4	Channel Islands National Park	Uncommon	471
5	Channel Islands National Park	Unknown	173
6	Joshua Tree National Park	Abundant	37
7	Joshua Tree National Park	Common	543
8	Joshua Tree National Park	Occasional	84
9	Joshua Tree National Park	Rare	90

In [ ]:

species_abundance = ...
species_abundance

In [ ]:

grader.check("q1_2")

2. Nachos 🧀 🌶️

In Python, Boolean values can either be True or False. We get Boolean values when using comparison operators, among which are < (less than), > (greater than), and == (equal to). For a more complete list, refer to this.

Run the cell below to see an example of a comparison operator in action.

In [ ]:

3 > 1 + 1

We can even assign the result of a comparison operation to a variable.

In [ ]:

result = 10 / 2 == 5
result

Arrays are compatible with comparison operators. The output is an array of boolean values.

In [ ]:

np.array([1, 5, 7, 8, 3, -1]) > 3

Waiting on the dining table just for you is a hot bowl of nachos! Let's say that whenever you take a nacho, it will have cheese, salsa, both, or neither (just a plain tortilla chip).

Using the function call np.random.choice(array_name), let's simulate taking nachos from the bowl at random. Start by running the cell below several times, and observe how the results change.

In [ ]:

nachos = np.array(['cheese', 'salsa', 'both', 'neither'])
np.random.choice(nachos)

Assume we took ten nachos at random, and stored the results in an array called ten_nachos.

In [ ]:

ten_nachos = np.array(['neither', 'cheese', 'both', 'both', 'cheese', 'salsa', 'both', 'neither', 'cheese', 'both'])

Question 2.1. Find the number of nachos with only cheese using code (do not hardcode the answer).

Hint: Our solution involves a comparison operator and the np.count_nonzero function.

In [ ]:

number_cheese = ...
number_cheese

In [ ]:

grader.check("q2_1")

Conditional Statements

A conditional statement is made up of multiple lines of code that allow Python to choose from different alternatives based on whether some condition is true.

Here is a basic example.

def sign(x):
    if x > 0:
        return 'Positive'

How the function works is if the input x is greater than 0, we get the string 'Positive' back.

If we want to test multiple conditions at once, we use the following general format.

if <if expression>:
    <if body>
elif <elif expression 0>:
    <elif body 0>
elif <elif expression 1>:
    <elif body 1>
...
else:
    <else body>

Only one of the bodies will ever be executed. Each if and elif (else-if) expression is evaluated and considered in order, starting at the top. As soon as a true value is found (i.e. once a condition is met), the corresponding body is executed, and the rest of the expression is skipped. If none of the if or elif expressions are true, then the else body is executed. For more examples and explanation, refer to CIT 9.1.

Question 2.2. Complete the following conditional statement so that the string 'More please' is assigned to say_please if the number of nachos with cheese in ten_nachos is less than 5.

In [ ]:

...
    say_please = 'More please'
    
say_please

In [ ]:

grader.check("q2_2")

Question 2.3. Write a function called nacho_reaction that returns a string based on the type of nacho passed in. From top to bottom, the conditions should correspond to: 'cheese', 'salsa', 'both', 'neither'.

In [ ]:

def nacho_reaction(nacho):
    ...
        return 'Cheesy!'
    # next condition should return 'Spicy!'
    ...
    # next condition should return 'Wow!'
    ...
    # next condition should return 'Meh.'
    ...

spicy_nacho = nacho_reaction('salsa')
spicy_nacho

In [ ]:

grader.check("q2_3")

Now consider the DataFrame ten_nachos_reactions defined below.

In [ ]:

ten_nachos_reactions = bpd.DataFrame().assign(Nachos=ten_nachos)
ten_nachos_reactions

Question 2.4. Add a column named 'Reactions' to the DataFrame ten_nachos_reactions that consists of reactions for each of the nachos in ten_nachos.

Hint: Use the apply method.

In [ ]:

ten_nachos_reactions = ...
ten_nachos_reactions

In [ ]:

grader.check("q2_4")

Question 2.5. Using code, find the number of 'Meh.' reactions for the nachos in ten_nachos_reactions. Think about how you could find this both by using DataFrame methods or by using np.count_nonzero.

In [ ]:

number_meh_reactions = ...
number_meh_reactions

In [ ]:

grader.check("q2_5")

Question 2.6. Copy the expression in the cell below to the following cell, and change some of the ==s in the expression to something else (like < or >) so that should_be_true is True.

In [ ]:

should_be_true = number_cheese == number_meh_reactions == np.count_nonzero(ten_nachos == 'both')
should_be_true

In [ ]:

should_be_true = ...
should_be_true

In [ ]:

grader.check("q2_6")

Question 2.7. Complete the function both_or_neither, which takes in a DataFrame of nachos with reactions (with the same column names as the ten_nachos_reactions DataFrame from Question 2.4) and returns 'Wow!' if there are more nachos with both cheese and salsa, or 'Meh.' if there are more nachos with neither. If there are an equal number of each, return 'Okay!'.

In [ ]:

def both_or_neither(nacho_df):
    nachos = ...
    number_both = ...
    number_neither = ...
    ...
        return 'Wow!'
    # The next condition should return 'Meh.'
    ...
    # The next condition should return 'Okay!'
    ...

# Below, we create a DataFrame with randomly-generated data and test your function on it.
# Do NOT change anything below this line.
# However, you may want to add a new cell and evaluate both_or_neither(ten_nachos_reactions) to see
# if your function behaves as expected.
np.random.seed(24)
many_nachos = bpd.DataFrame().assign(Nachos=np.random.choice(nachos, 250))
many_nachos = many_nachos.assign(Reactions=many_nachos.get("Nachos").apply(nacho_reaction))
result = both_or_neither(many_nachos)
result

In [ ]:

grader.check("q2_7")

3. Hungry Billy 🍗 🍕🍟

After a long day of class, Billy decides to go to Dirty Birds for dinner. Today's menu has Billy's four favorite foods: wings, pizza, fries, and mozzarella sticks. However, each dish has a 25% chance of running out before Billy can get to Dirty Birds.

Note: Use Python as your calculator. Your answers should be expressions (like 0.5 ** 2); don't simplify your answers using an outside calculator. Also, all of your answers should be given as decimals between 0 and 1, not percentages.

Question 3.1. What is the probability that Billy will be able to eat wings at Dirty Birds?

In [ ]:

wings_prob = ...
wings_prob

In [ ]:

grader.check("q3_1")

Question 3.2. What is the probability that Billy will be able to eat all four of these foods at Dirty Birds?

In [ ]:

all_prob = ...
all_prob

In [ ]:

grader.check("q3_2")

Question 3.3. What is the probability that Dirty Birds will have run out of something (anything) before Billy can get there?

In [ ]:

something_is_out = ...
something_is_out

In [ ]:

grader.check("q3_3")

To make up for their unpredictable food supply, Dirty Birds decides to hold a contest for some free HDH Dining swag. There is a bag with three red marbles, three green marbles, and three blue marbles. Billy has to draw three marbles without replacement. In order to win, all three marbles Billy draws must be of different colors.

Question 3.4. What is the probability that Billy wins the contest?

Hint: If you're stuck, start by determining the probability that the second marble Billy draws is different from the first marble Billy draws.

In [ ]:

winning_prob = ...
winning_prob

In [ ]:

grader.check("q3_4")

4. Iteration 🔂

Using a for loop, we can perform a task multiple times. This is known as iteration. Here, we'll simulate drawing different suits from a deck of cards. 🃏

In [ ]:

suits = np.array(['♣️', '♥️', '♠️', '♦️'])

draws = np.array([])

repetitions = 6

for i in np.arange(repetitions):
    draws = np.append(draws, np.random.choice(suits))

draws

Another use of iteration is to loop through a set of values. For instance, we can print out all of the colors of the rainbow. 🌈

In [ ]:

rainbow = np.array(["red", "orange", "yellow", "green", "blue", "indigo", "violet"])

for color in rainbow:
    print(color)

We can see that the indented part of the for loop, known as the body, is executed once for each item in rainbow. Note that the name color is arbitrary; we could replace both instances of color in the cell above with any valid variable name and the code would work the same.

We can also use a for loop to add to a variable in an iterative fashion. Here, we count the number of even numbers in an array of numbers. Each time we encounter an even number in num_array, we increase even_count by 1. To check if an individual number is even, we compute its remainder when divided by 2 using the % (modulus) operator.

In [ ]:

num_array = np.array([1, 3, 4, 7, 21, 23, 28, 28, 30])

even_count = 0

for i in num_array:
    if i % 2 == 0:
        even_count = even_count + 1
        
even_count

Question 4.1. Valentina is playing darts. 🎯 Her dartboard contains ten equal-sized zones with point values from 1 to 10. Write code using np.random.choice that simulates her total score after 1000 dart tosses.

In [ ]:

possible_point_values = ...
tosses = 1000

total_score = ...
for i in range(tosses):
    ...

total_score

In [ ]:

grader.check("q4_1")

Question 4.2. What is the average point value of a dart thrown by Valentina?

In [ ]:

average_score = ...
average_score

In [ ]:

grader.check("q4_2")

Question 4.3. In the following cell, we've loaded the text of Winnie-the-Pooh by A. A. Milne, the book we looked at in Homework 1. We've split the text into individual words, and stored these words in an array. Using a for loop, assign longer_than_four to the number of words in the novel that are more than 4 letters long. Look at CIT 9.2 if you get stuck.

Hint: You can find the number of letters in a word with the len function.

In [ ]:

winnie_string = open('data/winnie-the-pooh.txt', encoding='utf-8').read()
winnie_words = np.array(winnie_string.split())

...
        
longer_than_four

In [ ]:

grader.check("q4_3")

Finish Line 🏁

Congratulations! You are done with Lab 4.

To submit your assignment:

Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells.
Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.

In [ ]:

# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()

In [ ]: