Lab 4: DataFrames, Control Flow, and Probability
Due Saturday, October 22th at 11:59PM
Welcome to Lab 4! This week, we will go over more DataFrame manipulation techniques, conditionals and iteration, and introduce the concept of randomness. This lab is due on Saturday, October 22th at 11:59PM.
Refer to the following readings:
Grouping with subgroups (see BPD 11.4)
Merging DataFrames (see BPD 13)
Conditional statements (see CIT 9.1)
Iteration (see CIT 9.2)
Probability (see CIT 9.5)
First, set up the tests and imports by running the cell below.
1. California National Parks 🏞️ 🐻
In this question, we'll take a closer look at the DataFrame methods merge and groupby.
We will be working with two datasets, california_parks.csv (stored as parks) and california_parks_species.csv (stored as species), which provide information on California National Parks and the species of plants and animals found there, respectively. These are a subset of a larger dataset the National Parks Service provides. We've also created a third DataFrame, parks_species, that contains the number of species per park.
Run the cell below to load in our data.
Right now, the information we have on each California National Park is split across two DataFrames. The parks DataFrame has the code, state, size, and location of each park, and the parks_species DataFrame contains the number of species at each park. Run the cells below to see both DataFrames.
Question 1.1. Below, use the merge method to create a new DataFrame named parks_with_species, which will have the parks' existing information along with the number of species each has. Make sure the DataFrame only has one row per park. Your DataFrame should look like this:
| Park Code | Park Name | State | Acres | Latitude | Longitude | count | |
|---|---|---|---|---|---|---|---|
| 0 | CHIS | Channel Islands National Park | CA | 249561 | 34.01 | -119.42 | 1885 |
| 1 | JOTR | Joshua Tree National Park | CA | 789745 | 33.79 | -115.9 | 2294 |
| 2 | LAVO | Lassen Volcanic National Park | CA | 106372 | 40.49 | -121.51 | 1797 |
| 3 | PINN | Pinnacles National Park | CA | 26606 | 36.48 | -121.16 | 1416 |
| 4 | REDW | Redwood National Park | CA | 112512 | 41.3 | -124 | 6310 |
| 5 | SEKI | Sequoia and Kings Canyon National Parks | CA | 865952 | 36.43 | -118.68 | 1995 |
| 6 | YOSE | Yosemite National Park | CA | 761266 | 37.83 | -119.5 | 2088 |
Now, let's take a look at the species DataFrame. Each park has a lot of different species, and each species varies in abundance at each park.
Question 1.2. Using the groupby method, assign the variable species_abundance to a DataFrame that classifies the parks by both Park Name and Abundance.
Hint: Reset the index and assign columns so that you have three columns: 'Park Name', 'Abundance', and 'Category'. The first few rows of your DataFrame should look like this:
| Park Name | Abundance | Category | |
|---|---|---|---|
| 0 | Channel Islands National Park | Abundant | 48 |
| 1 | Channel Islands National Park | Common | 228 |
| 2 | Channel Islands National Park | Occasional | 190 |
| 3 | Channel Islands National Park | Rare | 368 |
| 4 | Channel Islands National Park | Uncommon | 471 |
| 5 | Channel Islands National Park | Unknown | 173 |
| 6 | Joshua Tree National Park | Abundant | 37 |
| 7 | Joshua Tree National Park | Common | 543 |
| 8 | Joshua Tree National Park | Occasional | 84 |
| 9 | Joshua Tree National Park | Rare | 90 |
2. Nachos 🧀 🌶️
In Python, Boolean values can either be True or False. We get Boolean values when using comparison operators, among which are < (less than), > (greater than), and == (equal to). For a more complete list, refer to this.
Run the cell below to see an example of a comparison operator in action.
We can even assign the result of a comparison operation to a variable.
Arrays are compatible with comparison operators. The output is an array of boolean values.
Waiting on the dining table just for you is a hot bowl of nachos! Let's say that whenever you take a nacho, it will have cheese, salsa, both, or neither (just a plain tortilla chip).

Using the function call np.random.choice(array_name), let's simulate taking nachos from the bowl at random. Start by running the cell below several times, and observe how the results change.
Assume we took ten nachos at random, and stored the results in an array called ten_nachos.
Question 2.1. Find the number of nachos with only cheese using code (do not hardcode the answer).
Hint: Our solution involves a comparison operator and the np.count_nonzero function.
Conditional Statements
A conditional statement is made up of multiple lines of code that allow Python to choose from different alternatives based on whether some condition is true.
Here is a basic example.
How the function works is if the input x is greater than 0, we get the string 'Positive' back.
If we want to test multiple conditions at once, we use the following general format.
Only one of the bodies will ever be executed. Each if and elif (else-if) expression is evaluated and considered in order, starting at the top. As soon as a true value is found (i.e. once a condition is met), the corresponding body is executed, and the rest of the expression is skipped. If none of the if or elif expressions are true, then the else body is executed. For more examples and explanation, refer to CIT 9.1.
Question 2.2. Complete the following conditional statement so that the string 'More please' is assigned to say_please if the number of nachos with cheese in ten_nachos is less than 5.
Question 2.3. Write a function called nacho_reaction that returns a string based on the type of nacho passed in. From top to bottom, the conditions should correspond to: 'cheese', 'salsa', 'both', 'neither'.
Now consider the DataFrame ten_nachos_reactions defined below.
Question 2.4. Add a column named 'Reactions' to the DataFrame ten_nachos_reactions that consists of reactions for each of the nachos in ten_nachos.
Hint: Use the apply method.
Question 2.5. Using code, find the number of 'Meh.' reactions for the nachos in ten_nachos_reactions. Think about how you could find this both by using DataFrame methods or by using np.count_nonzero.
Question 2.6. Copy the expression in the cell below to the following cell, and change some of the ==s in the expression to something else (like < or >) so that should_be_true is True.
Question 2.7. Complete the function both_or_neither, which takes in a DataFrame of nachos with reactions (with the same column names as the ten_nachos_reactions DataFrame from Question 2.4) and returns 'Wow!' if there are more nachos with both cheese and salsa, or 'Meh.' if there are more nachos with neither. If there are an equal number of each, return 'Okay!'.
3. Hungry Billy 🍗 🍕🍟
After a long day of class, Billy decides to go to Dirty Birds for dinner. Today's menu has Billy's four favorite foods: wings, pizza, fries, and mozzarella sticks. However, each dish has a 25% chance of running out before Billy can get to Dirty Birds.
Note: Use Python as your calculator. Your answers should be expressions (like 0.5 ** 2); don't simplify your answers using an outside calculator. Also, all of your answers should be given as decimals between 0 and 1, not percentages.
Question 3.1. What is the probability that Billy will be able to eat wings at Dirty Birds?
Question 3.2. What is the probability that Billy will be able to eat all four of these foods at Dirty Birds?
Question 3.3. What is the probability that Dirty Birds will have run out of something (anything) before Billy can get there?
To make up for their unpredictable food supply, Dirty Birds decides to hold a contest for some free HDH Dining swag. There is a bag with three red marbles, three green marbles, and three blue marbles. Billy has to draw three marbles without replacement. In order to win, all three marbles Billy draws must be of different colors.
Question 3.4. What is the probability that Billy wins the contest?
Hint: If you're stuck, start by determining the probability that the second marble Billy draws is different from the first marble Billy draws.
4. Iteration 🔂
Using a for loop, we can perform a task multiple times. This is known as iteration. Here, we'll simulate drawing different suits from a deck of cards. 🃏
Another use of iteration is to loop through a set of values. For instance, we can print out all of the colors of the rainbow. 🌈
We can see that the indented part of the for loop, known as the body, is executed once for each item in rainbow. Note that the name color is arbitrary; we could replace both instances of color in the cell above with any valid variable name and the code would work the same.
We can also use a for loop to add to a variable in an iterative fashion. Here, we count the number of even numbers in an array of numbers. Each time we encounter an even number in num_array, we increase even_count by 1. To check if an individual number is even, we compute its remainder when divided by 2 using the % (modulus) operator.
Question 4.1. Valentina is playing darts. 🎯 Her dartboard contains ten equal-sized zones with point values from 1 to 10. Write code using np.random.choice that simulates her total score after 1000 dart tosses.
Question 4.2. What is the average point value of a dart thrown by Valentina?
Question 4.3. In the following cell, we've loaded the text of Winnie-the-Pooh by A. A. Milne, the book we looked at in Homework 1. We've split the text into individual words, and stored these words in an array. Using a for loop, assign longer_than_four to the number of words in the novel that are more than 4 letters long. Look at CIT 9.2 if you get stuck.
Hint: You can find the number of letters in a word with the len function.
Finish Line 🏁
Congratulations! You are done with Lab 4.
To submit your assignment:
Select
Kernel -> Restart & Run Allto ensure that you have executed all cells, including the test cells.Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using
File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.