Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
dsc-courses
GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/labs/lab04/lab04.ipynb
3058 views
Kernel: Python 3 (ipykernel)

Lab 4: DataFrames, Control Flow, and Probability

Due Saturday, October 22th at 11:59PM

Welcome to Lab 4! This week, we will go over more DataFrame manipulation techniques, conditionals and iteration, and introduce the concept of randomness. This lab is due on Saturday, October 22th at 11:59PM.

Refer to the following readings:

First, set up the tests and imports by running the cell below.

import numpy as np import babypandas as bpd # These lines set up graphing capabilities. import matplotlib import matplotlib.pyplot as plt plt.style.use('ggplot') import otter grader = otter.Notebook() %reload_ext pandas_tutor

1. California National Parks 🏞️ 🐻

In this question, we'll take a closer look at the DataFrame methods merge and groupby.

We will be working with two datasets, california_parks.csv (stored as parks) and california_parks_species.csv (stored as species), which provide information on California National Parks and the species of plants and animals found there, respectively. These are a subset of a larger dataset the National Parks Service provides. We've also created a third DataFrame, parks_species, that contains the number of species per park.

Run the cell below to load in our data.

parks = bpd.read_csv("data/california_parks.csv") species = bpd.read_csv("data/california_parks_species.csv") parks_species = bpd.DataFrame().assign( count=species.groupby('Park Name').count().get('Category') )

Right now, the information we have on each California National Park is split across two DataFrames. The parks DataFrame has the code, state, size, and location of each park, and the parks_species DataFrame contains the number of species at each park. Run the cells below to see both DataFrames.

parks
parks_species

Question 1.1. Below, use the merge method to create a new DataFrame named parks_with_species, which will have the parks' existing information along with the number of species each has. Make sure the DataFrame only has one row per park. Your DataFrame should look like this:

Park CodePark NameStateAcresLatitudeLongitudecount
0CHISChannel Islands National ParkCA24956134.01-119.421885
1JOTRJoshua Tree National ParkCA78974533.79-115.92294
2LAVOLassen Volcanic National ParkCA10637240.49-121.511797
3PINNPinnacles National ParkCA2660636.48-121.161416
4REDWRedwood National ParkCA11251241.3-1246310
5SEKISequoia and Kings Canyon National ParksCA86595236.43-118.681995
6YOSEYosemite National ParkCA76126637.83-119.52088
parks_with_species = ... parks_with_species
grader.check("q1_1")

Now, let's take a look at the species DataFrame. Each park has a lot of different species, and each species varies in abundance at each park.

species

Question 1.2. Using the groupby method, assign the variable species_abundance to a DataFrame that classifies the parks by both Park Name and Abundance.

Hint: Reset the index and assign columns so that you have three columns: 'Park Name', 'Abundance', and 'Category'. The first few rows of your DataFrame should look like this:

Park NameAbundanceCategory
0Channel Islands National ParkAbundant48
1Channel Islands National ParkCommon228
2Channel Islands National ParkOccasional190
3Channel Islands National ParkRare368
4Channel Islands National ParkUncommon471
5Channel Islands National ParkUnknown173
6Joshua Tree National ParkAbundant37
7Joshua Tree National ParkCommon543
8Joshua Tree National ParkOccasional84
9Joshua Tree National ParkRare90
species_abundance = ... species_abundance
grader.check("q1_2")

2. Nachos 🧀 🌶️

In Python, Boolean values can either be True or False. We get Boolean values when using comparison operators, among which are < (less than), > (greater than), and == (equal to). For a more complete list, refer to this.

Run the cell below to see an example of a comparison operator in action.

3 > 1 + 1

We can even assign the result of a comparison operation to a variable.

result = 10 / 2 == 5 result

Arrays are compatible with comparison operators. The output is an array of boolean values.

np.array([1, 5, 7, 8, 3, -1]) > 3

Waiting on the dining table just for you is a hot bowl of nachos! Let's say that whenever you take a nacho, it will have cheese, salsa, both, or neither (just a plain tortilla chip).

Using the function call np.random.choice(array_name), let's simulate taking nachos from the bowl at random. Start by running the cell below several times, and observe how the results change.

nachos = np.array(['cheese', 'salsa', 'both', 'neither']) np.random.choice(nachos)

Assume we took ten nachos at random, and stored the results in an array called ten_nachos.

ten_nachos = np.array(['neither', 'cheese', 'both', 'both', 'cheese', 'salsa', 'both', 'neither', 'cheese', 'both'])

Question 2.1. Find the number of nachos with only cheese using code (do not hardcode the answer).

Hint: Our solution involves a comparison operator and the np.count_nonzero function.

number_cheese = ... number_cheese
grader.check("q2_1")

Conditional Statements

A conditional statement is made up of multiple lines of code that allow Python to choose from different alternatives based on whether some condition is true.

Here is a basic example.

def sign(x): if x > 0: return 'Positive'

How the function works is if the input x is greater than 0, we get the string 'Positive' back.

If we want to test multiple conditions at once, we use the following general format.

if <if expression>: <if body> elif <elif expression 0>: <elif body 0> elif <elif expression 1>: <elif body 1> ... else: <else body>

Only one of the bodies will ever be executed. Each if and elif (else-if) expression is evaluated and considered in order, starting at the top. As soon as a true value is found (i.e. once a condition is met), the corresponding body is executed, and the rest of the expression is skipped. If none of the if or elif expressions are true, then the else body is executed. For more examples and explanation, refer to CIT 9.1.

Question 2.2. Complete the following conditional statement so that the string 'More please' is assigned to say_please if the number of nachos with cheese in ten_nachos is less than 5.

... say_please = 'More please' say_please
grader.check("q2_2")

Question 2.3. Write a function called nacho_reaction that returns a string based on the type of nacho passed in. From top to bottom, the conditions should correspond to: 'cheese', 'salsa', 'both', 'neither'.

def nacho_reaction(nacho): ... return 'Cheesy!' # next condition should return 'Spicy!' ... # next condition should return 'Wow!' ... # next condition should return 'Meh.' ... spicy_nacho = nacho_reaction('salsa') spicy_nacho
grader.check("q2_3")

Now consider the DataFrame ten_nachos_reactions defined below.

ten_nachos_reactions = bpd.DataFrame().assign(Nachos=ten_nachos) ten_nachos_reactions

Question 2.4. Add a column named 'Reactions' to the DataFrame ten_nachos_reactions that consists of reactions for each of the nachos in ten_nachos.

Hint: Use the apply method.

ten_nachos_reactions = ... ten_nachos_reactions
grader.check("q2_4")

Question 2.5. Using code, find the number of 'Meh.' reactions for the nachos in ten_nachos_reactions. Think about how you could find this both by using DataFrame methods or by using np.count_nonzero.

number_meh_reactions = ... number_meh_reactions
grader.check("q2_5")

Question 2.6. Copy the expression in the cell below to the following cell, and change some of the ==s in the expression to something else (like < or >) so that should_be_true is True.

should_be_true = number_cheese == number_meh_reactions == np.count_nonzero(ten_nachos == 'both') should_be_true
should_be_true = ... should_be_true
grader.check("q2_6")

Question 2.7. Complete the function both_or_neither, which takes in a DataFrame of nachos with reactions (with the same column names as the ten_nachos_reactions DataFrame from Question 2.4) and returns 'Wow!' if there are more nachos with both cheese and salsa, or 'Meh.' if there are more nachos with neither. If there are an equal number of each, return 'Okay!'.

def both_or_neither(nacho_df): nachos = ... number_both = ... number_neither = ... ... return 'Wow!' # The next condition should return 'Meh.' ... # The next condition should return 'Okay!' ... # Below, we create a DataFrame with randomly-generated data and test your function on it. # Do NOT change anything below this line. # However, you may want to add a new cell and evaluate both_or_neither(ten_nachos_reactions) to see # if your function behaves as expected. np.random.seed(24) many_nachos = bpd.DataFrame().assign(Nachos=np.random.choice(nachos, 250)) many_nachos = many_nachos.assign(Reactions=many_nachos.get("Nachos").apply(nacho_reaction)) result = both_or_neither(many_nachos) result
grader.check("q2_7")

3. Hungry Billy 🍗 🍕🍟

After a long day of class, Billy decides to go to Dirty Birds for dinner. Today's menu has Billy's four favorite foods: wings, pizza, fries, and mozzarella sticks. However, each dish has a 25% chance of running out before Billy can get to Dirty Birds.

Note: Use Python as your calculator. Your answers should be expressions (like 0.5 ** 2); don't simplify your answers using an outside calculator. Also, all of your answers should be given as decimals between 0 and 1, not percentages.

Question 3.1. What is the probability that Billy will be able to eat wings at Dirty Birds?

wings_prob = ... wings_prob
grader.check("q3_1")

Question 3.2. What is the probability that Billy will be able to eat all four of these foods at Dirty Birds?

all_prob = ... all_prob
grader.check("q3_2")

Question 3.3. What is the probability that Dirty Birds will have run out of something (anything) before Billy can get there?

something_is_out = ... something_is_out
grader.check("q3_3")

To make up for their unpredictable food supply, Dirty Birds decides to hold a contest for some free HDH Dining swag. There is a bag with three red marbles, three green marbles, and three blue marbles. Billy has to draw three marbles without replacement. In order to win, all three marbles Billy draws must be of different colors.

Question 3.4. What is the probability that Billy wins the contest?

Hint: If you're stuck, start by determining the probability that the second marble Billy draws is different from the first marble Billy draws.

winning_prob = ... winning_prob
grader.check("q3_4")

4. Iteration 🔂

Using a for loop, we can perform a task multiple times. This is known as iteration. Here, we'll simulate drawing different suits from a deck of cards. 🃏

suits = np.array(['♣️', '♥️', '♠️', '♦️']) draws = np.array([]) repetitions = 6 for i in np.arange(repetitions): draws = np.append(draws, np.random.choice(suits)) draws

Another use of iteration is to loop through a set of values. For instance, we can print out all of the colors of the rainbow. 🌈

rainbow = np.array(["red", "orange", "yellow", "green", "blue", "indigo", "violet"]) for color in rainbow: print(color)

We can see that the indented part of the for loop, known as the body, is executed once for each item in rainbow. Note that the name color is arbitrary; we could replace both instances of color in the cell above with any valid variable name and the code would work the same.

We can also use a for loop to add to a variable in an iterative fashion. Here, we count the number of even numbers in an array of numbers. Each time we encounter an even number in num_array, we increase even_count by 1. To check if an individual number is even, we compute its remainder when divided by 2 using the % (modulus) operator.

num_array = np.array([1, 3, 4, 7, 21, 23, 28, 28, 30]) even_count = 0 for i in num_array: if i % 2 == 0: even_count = even_count + 1 even_count

Question 4.1. Valentina is playing darts. 🎯 Her dartboard contains ten equal-sized zones with point values from 1 to 10. Write code using np.random.choice that simulates her total score after 1000 dart tosses.

possible_point_values = ... tosses = 1000 total_score = ... for i in range(tosses): ... total_score
grader.check("q4_1")

Question 4.2. What is the average point value of a dart thrown by Valentina?

average_score = ... average_score
grader.check("q4_2")

Question 4.3. In the following cell, we've loaded the text of Winnie-the-Pooh by A. A. Milne, the book we looked at in Homework 1. We've split the text into individual words, and stored these words in an array. Using a for loop, assign longer_than_four to the number of words in the novel that are more than 4 letters long. Look at CIT 9.2 if you get stuck.

Hint: You can find the number of letters in a word with the len function.

winnie_string = open('data/winnie-the-pooh.txt', encoding='utf-8').read() winnie_words = np.array(winnie_string.split()) ... longer_than_four
grader.check("q4_3")

Finish Line 🏁

Congratulations! You are done with Lab 4.

To submit your assignment:

  1. Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells.

  2. Read through the notebook to make sure everything is fine and all tests passed.

  3. Run the cell below to run all tests, and make sure that they all pass.

  4. Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.

# For your convenience, you can run this cell to run all the tests at once! grader.check_all()