Homework 2: Arrays and DataFrames
Due Tuesday, October 11th at 11:59PM
Welcome to Homework 2! This week's homework will cover arrays and DataFrames in Python. You can find additional help on these topics in BPD 7-11 in the babypandas notes.
Instructions
This assignment is due Tuesday, October 11th at 11:59PM. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.
Important: For homeworks, the otter tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the Calendar) or EdStem. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.
Please do not use for-loops for any questions in this homework. If you don't know what a for-loop is, don't worry – we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.
1. Arrays 🗃️
Question 1.1. Make an array called quirky_numbers containing the following numbers (in the given order):
The cube root of 31
9 radians, in degrees
The mathematical constant of over 4: ()
The factorial of 5
Hint: Check out the functions constants available in the numpy module, which has been imported as np. If you're unsure of what function to use, a quick Google search should do the trick. Do not import math.
Note: In this problem, as with all others, we'll only check that your answer is correct. There may be several valid ways to produce the correct answer.
Question 1.2. Make an array called likes containing the following three strings:
'I like planting''my pets''and my friends!'
In Lecture 4, we looked at several string methods, like lower and replace. Strings have another method that we haven't looked at yet, called join. join takes one argument, an array of strings, and it returns a single string. Specifically, some_string.join(some_array) evaluates to a new string consisting of all of elements in some_array, with some_string inserted in between each element.
For example, '-'.join(np.array(['call', '858', '534', '2230'])) evaluates to 'call-858-534-2230'.
Question 1.3. Use the array likes and the method join to make two strings:
'I like planting, my pets, and my friends!'(call this oneby_comma)'I like planting my pets and my family!'(call this oneby_space)
Now let's get some practice accessing individual elements of arrays. In Python (and in many programming languages), elements are accessed by integer position, with the position of the first element being zero. That's probably not the way you learned to count, so it's easy to get mixed up here. Be careful!
Question 1.4. The cell below creates an array of strings.
What is the integer position of '🐈' in the array? You can just type in the answer, which should be of type int. This is a conceptual question, not a coding question.
Question 1.5. Suppose you have an array with 500 elements. What is the integer position of the fifth-last element in this array? You can just type in the answer, which should be of type int. This is a conceptual question, not a coding question.
Note: Your answer should be a positive integer!
Question 1.6. Suppose you have an array with 229 elements. At what integer position is the middle element of this array? You can just type in the answer, which should be of type int. This is a conceptual question, not a coding question.
Note: Again, your answer should be a positive integer!
By the way, it's also possible to use negative integer positions to access elements in an array, which can be easier than using positive integer positions sometimes. If a position is negative, you count from the end of the array rather than from the beginning. Position -1 corresponds to the last element, -2 corresponds to the second-last element, and so on. For instance, to find the third-last element of some_strings, we could use:
2. DSC 10 Enrollments 📈
It's time to get to know your classmates! The third column of the table below shows how many students in this quarter's offering of DSC 10 come from each of the seven colleges at UCSD. Looks like Sixth College is the most popular, with 93 students. The last column shows how many of these students are DSC majors. Of the 93 students from Sixth College, 54 are DSC majors.
For comparison's sake, we also have the corresponding data from the Fall 2021 offering of DSC 10. You can see that the class has grown quite a bit since last fall!
Throughout this problem, we'll assume that all students in DSC 10 come from one of the seven colleges in the table.
| College | Fall 21 Students | Fall 21 DSC Major Students | Fall 22 Students | Fall 22 DSC Major Students |
|---|---|---|---|---|
| Seventh | 28 | 19 | 37 | 21 |
| Sixth | 62 | 45 | 93 | 54 |
| Roosevelt | 22 | 9 | 67 | 39 |
| Warren | 31 | 19 | 61 | 40 |
| Marshall | 49 | 26 | 62 | 32 |
| Muir | 27 | 14 | 38 | 27 |
| Revelle | 30 | 12 | 52 | 31 |
In this question, we'll be working with the data from this table as arrays. Here are those arrays:
Remember, the numpy package (np for short) provides many handy functions for working with arrays. These are specifically designed to work with arrays and are faster than using Python's built-in functions.
Some frequently used array functions are np.min(), np.max(), np.sum(), np.abs(), and np.round(). There are many more, which you can browse by typing np. into a code cell and hitting the tab key, or by looking at the documentation.
Question 2.1. Assign enrolled_21 and enrolled_22 to the number of students that were enrolled in DSC 10 in Fall 2021 and Fall 2022, respectively.
Question 2.2. How many non-DSC major students from the Fall 2021 offering of DSC 10 came from each of the seven colleges? Your answer should be an array called non_majors_21, with the colleges in the same order as they appear in the table above. For instance, the first element of non_majors_21 should be the number of non-DSC majors in the Fall 2021 offering of DSC 10 who were in Seventh College.
Similarly, how many non-DSC major students from the Fall 2022 offering of DSC 10 came from each of the seven colleges? Your answer should be an array called non_majors_22, with the colleges in the same order as they appear in the table above.
Question 2.3. What percentage of Fall 2021 DSC 10 students from each college were DSC majors? Your answer should be an array called percent_majors_21, with the colleges in the same order as they appear in the table above, and with percentages rounded to two decimal places.
Similarly, what percentage of Fall 2022 DSC 10 students from each college were DSC majors? Your answer should be an array called percent_majors_22, with the colleges in the same order as they appear in the table above, and with percentages rounded to two decimal places.
Question 2.4. For each college, what is the absolute difference in the percentage of students enrolled in DSC 10 that are DSC majors from Fall 2021 to Fall 2022? Use percent_majors_21 and percent_majors_22 to create an array called abs_differences, with the colleges in the same order as they appear in the table above. Make sure the values in your answer are rounded to two decimal places.
Note: You don't need to round again.
For your convenience, we repeat the table from the start of the question below.
| College | Fall 21 Students | Fall 21 DSC Major Students | Fall 22 Students | Fall 22 DSC Major Students |
|---|---|---|---|---|
| Seventh | 28 | 19 | 37 | 21 |
| Sixth | 62 | 45 | 93 | 54 |
| Roosevelt | 22 | 9 | 67 | 39 |
| Warren | 31 | 19 | 61 | 40 |
| Marshall | 49 | 26 | 62 | 32 |
| Muir | 27 | 14 | 38 | 27 |
| Revelle | 30 | 12 | 52 | 31 |
Question 2.5. You might say that the most consistent college is the one with the smallest absolute difference in the percentage of students enrolled in DSC 10 that are DSC majors across the two years. Find the smallest value in the abs_differences array and save it as smallest_abs_diff. Referring back to the table, try to figure out which college that is. Assign most_consistent_college to the name of that college (as a string), exactly as it's displayed in the table.
Note: You can type the name of the college manually.
3. Analyzing NBA Salaries 🏀
The National Basketball Association (NBA) is the premier men's basketball league in North America. The 2022-23 NBA regular season starts on October 18th, just a week after this homework is due!

The file nba_salaries.csv in the data/ directory contains salary information for players who played in the NBA at some point between 1985-2018. See below for a description of all the data we have available.
| Column | Description |
|---|---|
'name' | Player name |
'season' | NBA season |
'salary' | Salary (not adjusted for inflation) |
'team' | Player's current team |
'position' | Position(s) the player was in throughout his career |
'draft_pick' | Overall draft rank (if drafted) |
'draft_year' | Year drafted into the NBA (if drafted) |
The file nba_17.csv in the data/ directory contains a subset of the data, for only the 2017-18 season.
Note that these player salaries are:
Not adjusted for inflation.
Not adjusted for the salary cap. The salary cap is the limit that an NBA team can pay its players in total. This limit gradually increased from around $35M in 2000 to around $156M in 2022, so players playing this season are likely making much more than those in our dataset!
Question 3.1. Read the file containing all salaries from 1985-2018 into a DataFrame called salaries. Read the file containing the 2017-18 salaries into a DataFrame called nba_17.
Note: In Questions 3.2 to 3.6, you will use nba_17. From that point on, you will use salaries.
Question 3.2. Create a new DataFrame, nba_17_id, by setting the index of nba_17 to 'player_id'. Don't change nba_17.
You should think about why we've chosen to set the index to 'player_id' rather than 'name'.
Question 3.3. In the 2017-18 season, Stephen Curry of the Golden State Warriors set a record for the most three-pointers made in an NBA Finals game. Using DataFrame operations, assign curry_17 to his salary during the 2017-18 season. The 'player_id' for Stephen Curry is 'curryst01'.
Question 3.4. Assign sixth_highest_salary to the sixth highest salary during the 2017-18 season. Assign sixth_player_name to the name of this player.
Don't type in the salary or player name by hand; get Python to extract this information for you.
Question 3.5. Suppose we want to analyze the number of years of NBA experience for each player from the 2017-18 NBA season. We will define a player's longevity as follows:
ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 38: …8 - \text{draft_̲year}Starting with the nba_17_id DataFrame, create a new DataFrame called longevity that has an additional column called 'years_played' containing the longevity of each NBA player as a float, sorted so that the players with the most years played are listed first.
Note: For players who were not drafted, their draft_year entry will be NaN, so their longevity will be recorded as NaN, which is okay! NaN stands for "not a number."
Question 3.6. The 2017-18 NBA Finals featured the Golden State Warriors and Cleveland Cavaliers in their fourth straight Finals matchup. What proportion of players on the 2017-18 Warriors and Cavaliers rosters earned at least $15M in salary? Assign this value to prop_15m.
Hint: First make a combined roster with all players from both teams, and then find the required proportion.
For the remainder of this section, we will use the full salaries DataFrame.
Question 3.7. Among all the salaries for players on the Atlanta Hawks from 1985-2018, what is the median salary? Assign this value to hawks_median.
Question 3.8. Assign highest_salary_tm to the team in salaries that has the highest mean player salary.
Hint: Our solution for this question used only one line of code (thanks, groupby)!
Question 3.9. In the NBA, there are five positions: center, power forward, small forward, point guard, and shooting guard. However, more recently, positions have become less important and players can fall into multiple roles. For example, LeBron James usually plays as a small forward, but played as a point guard in the 2020-21 season, and has played as other positions before (look here for a history).
Of all of the players from 1985-2018, we want to examine only the power forwards. Note that we are only considering seasons in which the player was playing power forward. For example, if a player played power forward in 1999 and shooting guard in 2000, we would only consider their 1999 season. Assign pw_forwards to a DataFrame that includes all the power forwards from 1985-2018, sorted so the highest salaries appear first.
Hints:
You may want to use
.str.contains; the Boolean Indexing section of the DSC 10 Reference Sheet shows you how to use it. Why are we using.str.containsand not simply checking for equality?The positions are strings, so they may have inconsistencies in how they're capitalized. If we want to include
'power forward','Power Forward','PoWeR FoRwaRD', and any other variations in capitalization, what operation should we call on the positions first? (You may end up using.strtwice!)If you do this correctly, you'll see the same player appear twice in the first three rows! 🐐
4. Are You Scared Yet? Analyzing Horror Movies 🎃😱

Spooky season is upon us! In honor of All Hallows' Eve, we've provided a file in the data/ directory called horror_movies.csv that contains 464 movies, each with 10 columns (see the table below) that we'll use to generate some insights about the state of horror movies in recent years.
| Column | Description |
|---|---|
'Title' | Title of the movie |
'Country' | Country the movie was originally released in |
'Maturity Rating' | A rating given to the movie by the Motion Picture Association |
'Review Rating' | The IMDB rating of the film |
'Language' | The language the movie is in |
'Filming Locations' | The location in which the movie was filmed |
'Budget' | The total amount spent on the movie |
'Release Month' | The month the movie was released |
'Release Day' | The day the movie was released |
'Run Time' | The length of the film in minutes |
Question 4.1. Read the file containing all of the horror movies into a DataFrame called horror.
Question 4.2. Examine the columns available in horror and consider which would be the best choice of index for this DataFrame. Change the horror DataFrame so that it's indexed by the values in this column instead of the default index.
Note: If you were to run the cell where you set the index of horror again, you'd see an error message. Stop and think about why you'd run into an error. Once you've thought about it, click the thinking emoji below to see the reason for the error.
horror to a column that no longer exists in horror – the column wouldn't exist because it was converted to the index the first time you ran the cell (and the index is not a column)!
If you acually ran the cell twice and got an error message, don't worry. To get rid of it, re-run the cell in 4.1 where you defined the horror DataFrame, then run the cell in 4.2 just once, and you'll be good to go.
When you submit your work for autograding, the entire notebook will be run from start to finish. Each cell will run only once, so it's no problem if your code errors on the second run. In this case, it means you're doing something right!
Question 4.3. The Purge, released in 2013, has had a massive impact on pop culture. What is the 'Review Rating' of 'The Purge (2013)'? Assign the rating to the variable purge_rating.
Question 4.4. Even spooky movies have production costs. Assign most_expensive to the name of the movie with the largest budget (including the year in parentheses), and set the total budget of that movie to petrifying_pricetag.
Question 4.5. Wow, that's a lot of money to spend on a horror movie, but how does that compare to the other movies included in the dataset? Compute the difference between the budget of the most expensive movie and the average movie budget, and assign the result to the variable above_average.
Question 4.6. What proportion of movies in our dataset were released in October? Set the proportion equal to october_prop.
Question 4.7. How many movies in our dataset were released on October 31st? Set the number equal to the variable halloween_count.
Question 4.8. Which movie titles contain the word 'zombie'? Create an array called zombie_movies containing the titles of all such movies. Then assign num_zombie_movies to the number of such movies.
Hints:
To convert a Series into an array, call the function
np.arrayon the Series.The movie names are all strings, so they may have inconsistencies in how they're capitalized. If we want to account for variations in capitalization, what operation should we call on the names first? (You may end up using
.strtwice!)
Question 4.9. What is the highest 'Review Rating' of a movie shot in a language other than English? Assign this rating to the variable foreign_rating.
Note: Some movies are shot in multiple languages, one of which might be English – for instance, the movie Devoured was filmed in 'English|Spanish'. For the purposes of this question, consider such movies to be shot in a language other than English.
Question 4.10. Wow, this was a long question, with 10 parts 😓. You know what's longer? The run time of the longest horror movie in the dataset 😱!
Assign longest_film_name to the name of the longest movie in the dataset (including the year in parentheses). Assign longest_film_length to the run time of this movie in minutes.
Don't type in the name or number of minutes by hand; get Python to extract this information for you.
Finish Line 🏁
Congratulations! You are done with Homework 2.
To submit your assignment:
Select
Kernel -> Restart & Run Allto ensure that you have executed all cells, including the test cells.Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using
File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.
