GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/homeworks/hw02/hw02.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

Homework 2: Arrays and DataFrames

Due Tuesday, October 11th at 11:59PM

Welcome to Homework 2! This week's homework will cover arrays and DataFrames in Python. You can find additional help on these topics in BPD 7-11 in the babypandas notes.

Instructions

This assignment is due Tuesday, October 11th at 11:59PM. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

Important: For homeworks, the otter tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the Calendar) or EdStem. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

Please do not use for-loops for any questions in this homework. If you don't know what a for-loop is, don't worry – we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.

In [ ]:

# Don't change this cell; just run it. Please don't import any additional packages.
import numpy as np
import babypandas as bpd

import otter
grader = otter.Notebook()

1. Arrays 🗃️

Question 1.1. Make an array called quirky_numbers containing the following numbers (in the given order):

The cube root of 31
9 radians, in degrees
$2^7 - 4^5$
The mathematical constant of $e$ over 4: ( $\frac{e}{4}$ )
The factorial of 5

Hint: Check out the functions constants available in the numpy module, which has been imported as np. If you're unsure of what function to use, a quick Google search should do the trick. Do not import math.

Note: In this problem, as with all others, we'll only check that your answer is correct. There may be several valid ways to produce the correct answer.

In [ ]:

quirky_numbers = ...
quirky_numbers

In [ ]:

grader.check("q1_1")

Question 1.2. Make an array called likes containing the following three strings:

'I like planting'
'my pets'
'and my friends!'

In [ ]:

likes = ...
likes

In [ ]:

grader.check("q1_2")

source

In Lecture 4, we looked at several string methods, like lower and replace. Strings have another method that we haven't looked at yet, called join. join takes one argument, an array of strings, and it returns a single string. Specifically, some_string.join(some_array) evaluates to a new string consisting of all of elements in some_array, with some_string inserted in between each element.

For example, '-'.join(np.array(['call', '858', '534', '2230'])) evaluates to 'call-858-534-2230'.

Question 1.3. Use the array likes and the method join to make two strings:

'I like planting, my pets, and my friends!' (call this one by_comma)
'I like planting my pets and my family!' (call this one by_space)

In [ ]:

by_comma = ...
by_space = ...

# Don't change the lines below.
print(by_comma)
print(by_space)

In [ ]:

grader.check("q1_3")

Now let's get some practice accessing individual elements of arrays. In Python (and in many programming languages), elements are accessed by integer position, with the position of the first element being zero. That's probably not the way you learned to count, so it's easy to get mixed up here. Be careful!

Question 1.4. The cell below creates an array of strings.

In [ ]:

some_strings = np.array(['🌻', '🌺', '🌼', '🌸', 'flower', 'plant', 'cat', '🐈', 'dog', '🐶'])
some_strings

What is the integer position of '🐈' in the array? You can just type in the answer, which should be of type int. This is a conceptual question, not a coding question.

In [ ]:

cat_emoji_position = ...
cat_emoji_position

In [ ]:

grader.check("q1_4")

Question 1.5. Suppose you have an array with 500 elements. What is the integer position of the fifth-last element in this array? You can just type in the answer, which should be of type int. This is a conceptual question, not a coding question.

Note: Your answer should be a positive integer!

In [ ]:

fifth_last_position = ...
fifth_last_position

In [ ]:

grader.check("q1_5")

Question 1.6. Suppose you have an array with 229 elements. At what integer position is the middle element of this array? You can just type in the answer, which should be of type int. This is a conceptual question, not a coding question.

Note: Again, your answer should be a positive integer!

In [ ]:

mid_position = ...
mid_position

In [ ]:

grader.check("q1_6")

By the way, it's also possible to use negative integer positions to access elements in an array, which can be easier than using positive integer positions sometimes. If a position is negative, you count from the end of the array rather than from the beginning. Position -1 corresponds to the last element, -2 corresponds to the second-last element, and so on. For instance, to find the third-last element of some_strings, we could use:

In [ ]:

some_strings[-3]

2. DSC 10 Enrollments 📈

It's time to get to know your classmates! The third column of the table below shows how many students in this quarter's offering of DSC 10 come from each of the seven colleges at UCSD. Looks like Sixth College is the most popular, with 93 students. The last column shows how many of these students are DSC majors. Of the 93 students from Sixth College, 54 are DSC majors.

For comparison's sake, we also have the corresponding data from the Fall 2021 offering of DSC 10. You can see that the class has grown quite a bit since last fall!

Throughout this problem, we'll assume that all students in DSC 10 come from one of the seven colleges in the table.

College	Fall 21 Students	Fall 21 DSC Major Students	Fall 22 Students	Fall 22 DSC Major Students
Seventh	28	19	37	21
Sixth	62	45	93	54
Roosevelt	22	9	67	39
Warren	31	19	61	40
Marshall	49	26	62	32
Muir	27	14	38	27
Revelle	30	12	52	31

In this question, we'll be working with the data from this table as arrays. Here are those arrays:

In [ ]:

students_21 = np.array([28, 62, 22, 31, 49, 27, 30])
students_21

In [ ]:

majors_21 = np.array([19, 45, 9, 19, 26, 14, 12])
majors_21

In [ ]:

students_22 = np.array([37, 93, 67, 61, 62, 38, 52])
students_22

In [ ]:

majors_22 = np.array([21, 54, 39, 40, 32, 27, 31])
majors_22

Remember, the numpy package (np for short) provides many handy functions for working with arrays. These are specifically designed to work with arrays and are faster than using Python's built-in functions.

Some frequently used array functions are np.min(), np.max(), np.sum(), np.abs(), and np.round(). There are many more, which you can browse by typing np. into a code cell and hitting the tab key, or by looking at the documentation.

Question 2.1. Assign enrolled_21 and enrolled_22 to the number of students that were enrolled in DSC 10 in Fall 2021 and Fall 2022, respectively.

In [ ]:

enrolled_21 = ...
enrolled_22 = ...

# Don't change the lines below.
print('Students in Fall 2021:', enrolled_21)
print('Students in Fall 2022:', enrolled_22)

In [ ]:

grader.check("q2_1")

Question 2.2. How many non-DSC major students from the Fall 2021 offering of DSC 10 came from each of the seven colleges? Your answer should be an array called non_majors_21, with the colleges in the same order as they appear in the table above. For instance, the first element of non_majors_21 should be the number of non-DSC majors in the Fall 2021 offering of DSC 10 who were in Seventh College.

Similarly, how many non-DSC major students from the Fall 2022 offering of DSC 10 came from each of the seven colleges? Your answer should be an array called non_majors_22, with the colleges in the same order as they appear in the table above.

In [ ]:

non_majors_21 = ...
non_majors_21

In [ ]:

grader.check("q2_2_1")

In [ ]:

non_majors_22 = ...
non_majors_22

In [ ]:

grader.check("q2_2_2")

Question 2.3. What percentage of Fall 2021 DSC 10 students from each college were DSC majors? Your answer should be an array called percent_majors_21, with the colleges in the same order as they appear in the table above, and with percentages rounded to two decimal places.

Similarly, what percentage of Fall 2022 DSC 10 students from each college were DSC majors? Your answer should be an array called percent_majors_22, with the colleges in the same order as they appear in the table above, and with percentages rounded to two decimal places.

In [ ]:

percent_majors_21 = ...
percent_majors_21

In [ ]:

grader.check("q2_3_1")

In [ ]:

percent_majors_22 = ...
percent_majors_22

In [ ]:

grader.check("q2_3_2")

Question 2.4. For each college, what is the absolute difference in the percentage of students enrolled in DSC 10 that are DSC majors from Fall 2021 to Fall 2022? Use percent_majors_21 and percent_majors_22 to create an array called abs_differences, with the colleges in the same order as they appear in the table above. Make sure the values in your answer are rounded to two decimal places.

Note: You don't need to round again.

In [ ]:

abs_differences = ...
abs_differences

In [ ]:

grader.check("q2_4")

For your convenience, we repeat the table from the start of the question below.

College	Fall 21 Students	Fall 21 DSC Major Students	Fall 22 Students	Fall 22 DSC Major Students
Seventh	28	19	37	21
Sixth	62	45	93	54
Roosevelt	22	9	67	39
Warren	31	19	61	40
Marshall	49	26	62	32
Muir	27	14	38	27
Revelle	30	12	52	31

Question 2.5. You might say that the most consistent college is the one with the smallest absolute difference in the percentage of students enrolled in DSC 10 that are DSC majors across the two years. Find the smallest value in the abs_differences array and save it as smallest_abs_diff. Referring back to the table, try to figure out which college that is. Assign most_consistent_college to the name of that college (as a string), exactly as it's displayed in the table.

Note: You can type the name of the college manually.

In [ ]:

smallest_abs_diff = ...
most_consistent_college = ...

In [ ]:

grader.check("q2_5")

3. Analyzing NBA Salaries 🏀

The National Basketball Association (NBA) is the premier men's basketball league in North America. The 2022-23 NBA regular season starts on October 18th, just a week after this homework is due!

The file nba_salaries.csv in the data/ directory contains salary information for players who played in the NBA at some point between 1985-2018. See below for a description of all the data we have available.

Column	Description
`'name'`	Player name
`'season'`	NBA season
`'salary'`	Salary (not adjusted for inflation)
`'team'`	Player's current team
`'position'`	Position(s) the player was in throughout his career
`'draft_pick'`	Overall draft rank (if drafted)
`'draft_year'`	Year drafted into the NBA (if drafted)

The file nba_17.csv in the data/ directory contains a subset of the data, for only the 2017-18 season.

Note that these player salaries are:

Not adjusted for inflation.
Not adjusted for the salary cap. The salary cap is the limit that an NBA team can pay its players in total. This limit gradually increased from around $35M in 2000 to around $156M in 2022, so players playing this season are likely making much more than those in our dataset!

Question 3.1. Read the file containing all salaries from 1985-2018 into a DataFrame called salaries. Read the file containing the 2017-18 salaries into a DataFrame called nba_17.

In [ ]:

salaries = ...
nba_17 = ...

In [ ]:

grader.check("q3_1")

Note: In Questions 3.2 to 3.6, you will use nba_17. From that point on, you will use salaries.

Question 3.2. Create a new DataFrame, nba_17_id, by setting the index of nba_17 to 'player_id'. Don't change nba_17.

In [ ]:

nba_17_id = ...
nba_17_id

In [ ]:

grader.check("q3_2")

You should think about why we've chosen to set the index to 'player_id' rather than 'name'.

Question 3.3. In the 2017-18 season, Stephen Curry of the Golden State Warriors set a record for the most three-pointers made in an NBA Finals game. Using DataFrame operations, assign curry_17 to his salary during the 2017-18 season. The 'player_id' for Stephen Curry is 'curryst01'.

In [ ]:

curry_17 = ...
curry_17

In [ ]:

grader.check("q3_3")

Question 3.4. Assign sixth_highest_salary to the sixth highest salary during the 2017-18 season. Assign sixth_player_name to the name of this player.

Don't type in the salary or player name by hand; get Python to extract this information for you.

In [ ]:

sixth_highest_salary = ...
sixth_player_name = ...

# Don't change the lines below.
print('Player:', sixth_player_name)
print('Salary:', sixth_highest_salary)

In [ ]:

grader.check("q3_4")

Question 3.5. Suppose we want to analyze the number of years of NBA experience for each player from the 2017-18 NBA season. We will define a player's longevity as follows:

ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 38: …8 - \text{draft_̲year}

Starting with the nba_17_id DataFrame, create a new DataFrame called longevity that has an additional column called 'years_played' containing the longevity of each NBA player as a float, sorted so that the players with the most years played are listed first.

Note: For players who were not drafted, their draft_year entry will be NaN, so their longevity will be recorded as NaN, which is okay! NaN stands for "not a number."

In [ ]:

longevity = ...
longevity

In [ ]:

grader.check("q3_5")

Question 3.6. The 2017-18 NBA Finals featured the Golden State Warriors and Cleveland Cavaliers in their fourth straight Finals matchup. What proportion of players on the 2017-18 Warriors and Cavaliers rosters earned at least $15M in salary? Assign this value to prop_15m.

Hint: First make a combined roster with all players from both teams, and then find the required proportion.

In [ ]:

prop_15m = ...
prop_15m

In [ ]:

grader.check("q3_6")

For the remainder of this section, we will use the full salaries DataFrame.

Question 3.7. Among all the salaries for players on the Atlanta Hawks from 1985-2018, what is the median salary? Assign this value to hawks_median.

In [ ]:

hawks_median = ...
hawks_median

In [ ]:

grader.check("q3_7")

Question 3.8. Assign highest_salary_tm to the team in salaries that has the highest mean player salary.

Hint: Our solution for this question used only one line of code (thanks, groupby)!

In [ ]:

highest_salary_tm = ...
highest_salary_tm

In [ ]:

grader.check("q3_8")

Question 3.9. In the NBA, there are five positions: center, power forward, small forward, point guard, and shooting guard. However, more recently, positions have become less important and players can fall into multiple roles. For example, LeBron James usually plays as a small forward, but played as a point guard in the 2020-21 season, and has played as other positions before (look here for a history).

Of all of the players from 1985-2018, we want to examine only the power forwards. Note that we are only considering seasons in which the player was playing power forward. For example, if a player played power forward in 1999 and shooting guard in 2000, we would only consider their 1999 season. Assign pw_forwards to a DataFrame that includes all the power forwards from 1985-2018, sorted so the highest salaries appear first.

Hints:

You may want to use .str.contains; the Boolean Indexing section of the DSC 10 Reference Sheet shows you how to use it. Why are we using .str.contains and not simply checking for equality?
The positions are strings, so they may have inconsistencies in how they're capitalized. If we want to include 'power forward', 'Power Forward', 'PoWeR FoRwaRD', and any other variations in capitalization, what operation should we call on the positions first? (You may end up using .str twice!)
If you do this correctly, you'll see the same player appear twice in the first three rows! 🐐

In [ ]:

# For a multi-step problem like this one, it's helpful to define intermediate variables. 
# Feel free to do that here, or for any problem!

pw_forwards = ...
pw_forwards

In [ ]:

grader.check("q3_9")

4. Are You Scared Yet? Analyzing Horror Movies 🎃😱

Spooky season is upon us! In honor of All Hallows' Eve, we've provided a file in the data/ directory called horror_movies.csv that contains 464 movies, each with 10 columns (see the table below) that we'll use to generate some insights about the state of horror movies in recent years.

Column	Description
`'Title'`	Title of the movie
`'Country'`	Country the movie was originally released in
`'Maturity Rating'`	A rating given to the movie by the Motion Picture Association
`'Review Rating'`	The IMDB rating of the film
`'Language'`	The language the movie is in
`'Filming Locations'`	The location in which the movie was filmed
`'Budget'`	The total amount spent on the movie
`'Release Month'`	The month the movie was released
`'Release Day'`	The day the movie was released
`'Run Time'`	The length of the film in minutes

Question 4.1. Read the file containing all of the horror movies into a DataFrame called horror.

In [ ]:

horror = ...
horror

In [ ]:

grader.check("q4_1")

Question 4.2. Examine the columns available in horror and consider which would be the best choice of index for this DataFrame. Change the horror DataFrame so that it's indexed by the values in this column instead of the default index.

In [ ]:

horror = ...
horror

In [ ]:

grader.check("q4_2")

Note: If you were to run the cell where you set the index of horror again, you'd see an error message. Stop and think about why you'd run into an error. Once you've thought about it, click the thinking emoji below to see the reason for the error.

Why would there be an error? 🤔 There would be an error since you'd be trying to set the index of horror to a column that no longer exists in horror – the column wouldn't exist because it was converted to the index the first time you ran the cell (and the index is not a column)!

If you acually ran the cell twice and got an error message, don't worry. To get rid of it, re-run the cell in 4.1 where you defined the horror DataFrame, then run the cell in 4.2 just once, and you'll be good to go.

When you submit your work for autograding, the entire notebook will be run from start to finish. Each cell will run only once, so it's no problem if your code errors on the second run. In this case, it means you're doing something right!

Question 4.3. The Purge, released in 2013, has had a massive impact on pop culture. What is the 'Review Rating' of 'The Purge (2013)'? Assign the rating to the variable purge_rating.

In [ ]:

purge_rating = ...
purge_rating

In [ ]:

grader.check("q4_3")

Question 4.4. Even spooky movies have production costs. Assign most_expensive to the name of the movie with the largest budget (including the year in parentheses), and set the total budget of that movie to petrifying_pricetag.

In [ ]:

most_expensive = ...
petrifying_pricetag = ...
most_expensive, petrifying_pricetag

In [ ]:

grader.check("q4_4")

Question 4.5. Wow, that's a lot of money to spend on a horror movie, but how does that compare to the other movies included in the dataset? Compute the difference between the budget of the most expensive movie and the average movie budget, and assign the result to the variable above_average.

In [ ]:

above_average = ...
above_average

In [ ]:

grader.check("q4_5")

Question 4.6. What proportion of movies in our dataset were released in October? Set the proportion equal to october_prop.

In [ ]:

october_prop = ...
october_prop

In [ ]:

grader.check("q4_6")

Question 4.7. How many movies in our dataset were released on October 31st? Set the number equal to the variable halloween_count.

In [ ]:

halloween_count = ...
halloween_count

In [ ]:

grader.check("q4_7")

Question 4.8. Which movie titles contain the word 'zombie'? Create an array called zombie_movies containing the titles of all such movies. Then assign num_zombie_movies to the number of such movies.

Hints:

To convert a Series into an array, call the function np.array on the Series.
The movie names are all strings, so they may have inconsistencies in how they're capitalized. If we want to account for variations in capitalization, what operation should we call on the names first? (You may end up using .str twice!)

In [ ]:

zombie_movies = ...
num_zombie_movies = ...
num_zombie_movies

In [ ]:

grader.check("q4_8")

Question 4.9. What is the highest 'Review Rating' of a movie shot in a language other than English? Assign this rating to the variable foreign_rating.

Note: Some movies are shot in multiple languages, one of which might be English – for instance, the movie Devoured was filmed in 'English|Spanish'. For the purposes of this question, consider such movies to be shot in a language other than English.

In [ ]:

foreign_rating = ...
foreign_rating

In [ ]:

grader.check("q4_9")

Question 4.10. Wow, this was a long question, with 10 parts 😓. You know what's longer? The run time of the longest horror movie in the dataset 😱!

Assign longest_film_name to the name of the longest movie in the dataset (including the year in parentheses). Assign longest_film_length to the run time of this movie in minutes.

Don't type in the name or number of minutes by hand; get Python to extract this information for you.

In [ ]:

longest_film_name = ...
longest_film_length = ...

# Don't change the line below.
print('Longest film:', longest_film_name, '\nLength:', longest_film_length, 'minutes')

In [ ]:

grader.check("q4_10")

Finish Line 🏁

Congratulations! You are done with Homework 2.

To submit your assignment:

Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells.
Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.

In [ ]:

# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()