GitHub Repository: dsc-courses/dsc10-2022-fa
Path: blob/main/homeworks/hw03/hw03.ipynb
³⁰⁵⁸ views

Kernel: Python 3 (ipykernel)

Homework 3: Data Visualization and Python Functions

Due Tuesday, October 18th at 11:59PM

Welcome to Homework 3! This week, we will cover DataFrame manipulations, making visualizations, and defining functions. You can find additional help on these topics in BPD 10-12 in the babypandas notes and CIT 7-7.3 in the textbook.

Instructions

This assignment is due on Tuesday, October 18th at 11:59PM. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

Important: For homeworks, the otter tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the Calendar) or EdStem. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Please do not use for-loops for any questions in this homework. If you don't know what a for-loop is, don't worry – we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.

In [ ]:

# Please don't change this cell, but do make sure to run it
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('ggplot')

import otter
grader = otter.Notebook()

1. Winter is Coming ❄️ ⚔️

Game of Thrones is a hit fantasy-adventure series based on George R.R. Martin's book series, "A Song of Ice and Fire." The series is set in a fictional world, where powerful families, or "houses," fight for control of the Seven Kingdoms. In this question, we will investigate these battles. The data used in this question comes from Chris Albon's "The War of the Five Kings" dataset, which contains detailed information on all of the battles in the series.

The file named battles.csv in the data/ directory has a row for each battle, and the following columns.

Column	Description
`'name'`	The name of the battle.
`'year'`	The year in which the battle took place. The dataset contains battles from the years 298 to 300.
`'attacker_king'`	The name of the king who was on the attacking side of the battle.
`'defender_king'`	The name of the king who was on the defending side of the battle.
`'major_death'`	Indicates whether a major character died during the battle. Values of `1.0` mean that some major character died, `0.0` means not.
`'major_capture'`	Indicates whether a major character was captured during the battle. Values of `1.0` mean that some major character was captured, `0.0` means not.
`'attacker_size'`	Number of people in the attacking army.
`'defender_size'`	Number of people in the defending army.
`'summer'`	Indicates if the battle took place in summer. Values of `1.0` mean the battle was in summer, `0.0` means not.
`'location'`	The specific location where the battle took place.
`'region'`	The more general region where the battle took place.
`'attacker_win'`	Indicates whether the attacking army won the battle. Values of `1.0` mean the attackers won, `0.0` means they lost.

First, we'll read the data in as a DataFrame.

In [ ]:

battles = bpd.read_csv('data/battles.csv')
battles

You may notice the DataFrame has many NaN values. NaN means "not a number," and it's how babypandas handles missing values. You don't need to worry about these, as they won't affect any of the calculations. Python knows to ignore them.

Let's explore certain columns to get to know the data a little better. The .describe() method gives us some useful information about a column. Try it out on the name column.

In [ ]:

battles.get('name').describe()

We learn that this column has 38 values, all of which are unique, and as a result the most frequent name appears only once.

If we try this same command on the 'attacker_king' column, we'll see that although there are 36 values (and therefore 2 missing values), there are only 4 distinct values. There are many battles with the same 'attacker_king'. The most common 'attacker_king' is 'Joffrey/Tommen Baratheon', with 14 instances.

In [ ]:

battles.get('attacker_king').describe()

Question 1.1. The index of a DataFrame contains the label (or "name") of each row. Set the index of battles to whichever column makes the most sense. Make sure that you're actually changing the battles DataFrame!

In [ ]:

battles = ...
battles

In [ ]:

grader.check("q1_1")

Question 1.2. Assign weakest_attack to the name of the battle with the smallest attacking army and weakest_defense to the name of the battle smallest defending army.

Similarly, assign strongest_attack to the name of the battle with the largest attacking army and strongest_defense to the name of the battle with the largest defending army

In the case of a tie, choose any one of the battles included in the tie.

Hint: When sorting values, NaN is always sorted to the last position, no matter if you chose to sort in ascending or descending order. Since there are NaN values in this dataset, accessing the last index will probably give you an incorrect answer. Sort accordingly!

In [ ]:

weakest_attack = ...
print("Weakest attack:", weakest_attack)

strongest_attack = ...
print("Strongest attack:", strongest_attack)

weakest_defense = ...
print("Weakest defense:", weakest_defense)

strongest_defense = ...
print("Strongest defense:", strongest_defense)

In [ ]:

grader.check("q1_2")

Question 1.3. Leaders often like to choose where their battles are fought so they can gain an advantage over their opponent. Create a DataFrame named wester_river_north containing only the the battles from 'The Westerlands', 'The Riverlands', and 'The North' regions. All columns of battles should be included.

In [ ]:

wester_river_north = battles[(battles.get('region') == 'The Westerlands') | 
                             (battles.get('region') == 'The Riverlands') | 
                             ...
wester_river_north

In [ ]:

grader.check("q1_3")

Question 1.4. Make an appropriate plot that would help you answer the question,

Among battles from 'The Westerlands', 'The Riverlands', and 'The North' regions, do those with larger 'attacker_size' also have larger 'defender_size'?

You only need to make a plot, you don't need to answer the question above.

Notes:

You may notice that some battles are missing from your plot. This is to be expected – if a battle is missing data for 'attacker_size' or 'defender_size', it won't be displayed in your plot.
You're encouraged to add a title to your plot, using the title optional argument, and to change the axis labels using the plt.xlabel and plt.ylabel functions, but you're not required to.

In [ ]:

# Create your plot here
...

Question 1.5. Some characters in Game of Thrones have a large impact on the plot. When they die in battle, we record their deaths in the 'major_death' column, which has a value of 1.0 if a major death occurred in that battle and 0.0 otherwise.

Create a DataFrame named major_deaths, indexed by 'attacker_king'. This DataFrame should have just one column, named 'num_major', that contains the total number of major deaths that each attacker_king saw in their battles.

Hint: You will need to change the names of columns, which you can do using .assign and .drop. Instead of using .drop, you may want to use .get and pass in a list containing the name of a single column that you want to keep. This was done in Lecture 8.

In [ ]:

major_deaths = ...
major_deaths

In [ ]:

grader.check("q1_5")

Question 1.6. It turns out that there is at least one 'attacker_king' that never had any major deaths in their battles. Below, assign happy_kings to an array of the name(s) of these 'attacker_king'(s). Use Python to extract the name(s) for you; do not type in any name(s).

In [ ]:

happy_kings = ...
happy_kings

In [ ]:

grader.check("q1_6")

Question 1.7. Suppose that you are royalty in the Game of Thrones universe, and you want to conquer more land. You want to team up with the strongest kings, so you're trying to find out who can give you the best attacking army. You want to find armies that:

Are large.
Manage to get rid of their enemies' leaders often.
Win often.

Fortunately, these three properties correspond to three columns in battles. Create a DataFrame called mean_stats, indexed by 'attacker_king', that contains the mean 'attacker_size', 'major_death', and 'attacker_win' for each king. mean_stats should only have these three columns.

In [ ]:

mean_stats = ...
mean_stats

In [ ]:

grader.check("q1_7")

Question 1.8. While it might make for an entertaining show, it would be no fun to participate in a big battle. From the perspective of a soldier in battle, we'll say that a battle is considered "bad" if there were lots of attackers/defenders involved, if important people were captured, and if important people were killed. Additionally, we'll say that battles that take place in summer are considered worse, as the heat makes it harder for soldiers to carry heavy armor and equipment.

A battle's "badness rating" is a weighted average of these characteristics, with the weights as follows:

'attacker_size': 20%
'defender_size': 20%
'major_death': 30%
'major_capture': 20%
'summer': 10%

For example, 'Battle of the Whispering Wood' has:

an 'attacker_size' of 1875,
a 'defender_size' of 6000,
a 'major_death' of 1.0,
a 'major_capture' of 1.0, and
a 'summer' of 1.0,

so calculate_badness("Battle of the Whispering Wood") should return 1575.6.

Define a function called calculate_badness that takes in a battle's name and outputs the battle's "badness rating".

Note: Once you've implemented calculate_badness, you should verify that your function works as intended, both for the example above and for other battles. This is good practice in general!

In [ ]:

def calculate_badness(name):
    ...

In [ ]:

grader.check("q1_8")

Question 1.9. Use the calculate_badness function you've already written, along with the .apply method, to create an array called badness_array that contains the "badness rating" of each battle, in the same order as the rows of the battles DataFrame. Many battles may not have a "badness rating" if some data needed to calculate it is missing.

Hint: Note that the .apply method allows you to apply a function to any column in a DataFrame, but not to the index. Instead, try using .apply on a version of the DataFrame that has the index reset to its default.

In [ ]:

badness_array = ...
badness_array

In [ ]:

grader.check("q1_9")

Question 1.10. Create a DataFrame called with_badness that contains all the columns of battles plus one more called 'badness', containing the values in badness_array. Order the rows in descending order of 'badness'. Make sure with_badness has the same index as battles.

Then, save the name of the battle with the largest "badness rating" as worst_battle.

In [ ]:

with_badness = ...
worst_battle = ...

print('The battle with the largest "badness rating" is:', worst_battle)
with_badness

In [ ]:

grader.check("q1_10")

2. Game On! 🎮

The last section was about Game of Thrones. This section is also about games... video games! (Bad joke, we know 🙄.)

Here, we'll work with a dataset taken from Kaggle that contains sales information for thousands of video games, including some released this year. In the cell below, we load the dataset in as a DataFrame named video_games. Take some time to understand what each column of video_games represents, as we haven't provided you with a description of each column.

In [ ]:

# Run this cell to load the dataset.
video_games = bpd.read_csv('data/video_game_ratings.csv')
video_games

Question 2.1. If you look at the 'votes' column in the DataFrame, you'll notice that there are commas in some of the numbers. For example, in the second row of the DataFrame, the value in the votes column is 36,441. These commas indicate that the 'votes' column contains strings, not integers, since Python never displays integers with commas.

Write a function convert_votes_to_int that takes in a string v as input and outputs v as an integer, after removing any commas.

Then, use your function to update the 'votes' column in the video_games DataFrame so that it contains integers rather than strings. Make sure to "save" your changes in the video_games DataFrame!

Hint: You can use the .assign method to replace values in the 'votes' column, without having to create an additional column.

In [ ]:

def convert_votes_to_int(v):
    ...

In [ ]:

video_games = ...
video_games

In [ ]:

grader.check("q2_1")

Question 2.2. You are curious as to whether there is a relationship between the number of votes a game receives and the rating of the game. Create an appropriate plot that shows the relationship between these two variables.

Then use the plot you made to answer the following question:

What type of ratings do video games with a higher number of votes tend to have?

Assign an integer from 1 to 3 representing your answer to the variable q2_2.

Video games with a higher number of votes tend to have higher ratings.
Video games with a higher number of votes tend to have lower ratings.
There is no association between number of votes and rating.

In [ ]:

# Create your plot here
...

In [ ]:

q2_2 = ...
q2_2

In [ ]:

grader.check("q2_2")

Question 2.3. Assign most_common_genres to a DataFrame that contains the ten most common genres of video games, in descending order. The DataFrame should be indexed by 'genre' and have only one column, 'count', which is the number of video games in that genre.

Note: For this question, we will treat each video game as having only one genre. For example, 'Action, Adventure, Drama' is considered to be its own genre.

In [ ]:

most_common_genres = ...
most_common_genres

In [ ]:

grader.check("q2_3")

Question 2.4. Using the most_common_genres DataFrame you created in Question 2.3, create a horizontal bar chart that shows the distribution of video games into these ten genres. Make sure your plot has the most common genre as the top-most bar in the bar chart.

In [ ]:

# Create your plot here
...

Question 2.5. Assign the variable third_lowest to the genre of video games with the third lowest average rating (among all genres, not just the ones you looked at in Questions 2.3 and 2.4).

Do not manually type out your answer. Use babypandas methods to produce the answer.

Note: Again, we will consider a video game with multiple genres to have only one genre. For example, 'Action, Adventure, Drama' is considered to be its own genre.

In [ ]:

third_lowest = ...
third_lowest

In [ ]:

grader.check("q2_5")

Question 2.6. Create a histogram showing the distribution of video game ratings in the video_games DataFrame.

Remember to set density=True since we always use density histograms and ec='w' to make the separation of the bars more clear. You don't have to set the bins argument.

In [ ]:

# Create your plot here
...

Question 2.7. There is one director who has directed exactly 27 video games alone (with no-codirectors).

Below, assign director_of_27 to the name of this director. Do not manually type out the director's name. Instead, use babypandas methods to extract the name.

In [ ]:

director_of_27 = ...
director_of_27

In [ ]:

grader.check("q2_7")

3. The Best Invention of the 20th Century 🍜

Instant ramen was first invented by Momofuku Ando in 1958 to cure hunger during wartime. It started off with only one kind for its original purpose, but the instant ramen industry has expanded over the years, and now there are over 100 different kinds of instant ramen. At the turn of the millenium, 2000 Japanese citizens even ranked instant ramen as the best invention of the 20th century! Click here to learn more about the history of this quintessential college meal.

We have a dataset of instant ramen ratings from Kaggle. First, we'll read in the data from a CSV. There is no good index, so we will leave it unset.

In [ ]:

ramen_data = bpd.read_csv('data/ramen-rating.csv')
ramen_data

Notice that the 'Country' column contains a country code. We want to convert these country codes into actual country names that everyone can understand.

We'll use a Python dictionary to help us with this conversion. A dictionary is a simple way to map a unique key to a value. For example, the dictionary below maps course codes to course names.

In [ ]:

dsc_courses = {
    # key: value
    'DSC 10': 'Principles of Data Science',
    'DSC 20': 'Programming and Basic Data Structures for Data Science',
    'DSC 30': 'Data Structures and Algorithms for Data Science',
    'DSC 40A': 'Theoretical Foundations of Data Science I',
    'DSC 40B': 'Theoretical Foundations of Data Science II',
    'DSC 80': 'The Practice and Application of Data Science'
}

We can access the value corresponding to each key using bracket notation.

In [ ]:

dsc30_name = dsc_courses['DSC 30']
dsc30_name

Here, 'DSC 30' is the key and 'Data Structures and Algorithms for Data Science' is the value.

Let's use a dictionary to help us with our country code to country name conversion. Below is a dictionary containing country codes as keys and country names as values for each of the countries in our ramen dataset.

In [ ]:

# Run this cell, DO NOT change it.
country_codes = {
    'AU':'Australia',
    'BD':'Bangladesh', 
    'BR':'Brazil', 
    'KH':'Cambodia' , 
    'CA':'Canada', 
    'CN':'China',
    'CO':'Colombia', 
    'DXB':'Dubai' , 
    'EE':'Estonia' , 
    'FIJI':'Fiji', 
    'FI':'Finland' , 
    'DE':'Germany',
    'GHAN':'Ghana' , 
    'NL':'Holland', 
    'HK':'Hong Kong', 
    'HU':'Hungary', 
    'IN':'India', 
    'ID':'Indonesia',
    'JP':'Japan', 
    'MY':'Malaysia', 
    'MX':'Mexico', 
    'MM':'Myanmar', 
    'NP':'Nepal', 
    'AN':'Netherlands',
    'NG':'Nigeria', 
    'PK':'Pakistan', 
    'PH':'Philippines', 
    'PL':'Poland', 
    'SWK':'Sarawak',
    'SG':'Singapore', 
    'KOR':'South Korea', 
    'SE':'Sweden', 
    'TW':'Taiwan', 
    'TH':'Thailand', 
    'UK' :'United Kingdom' ,
    'USA':'United States', 
    'VN':'Vietnam' 
    }

Question 3.1. Using the dictionary country_codes, define a function named code_to_country that takes as input a country code and returns the corresponding country's name. This should only take one line of code.

Hints:

If you're stuck, take a look at the DSC 30 example above.
Once you've implemented code_to_country, you should verify that it works as intended by trying a few examples yourself. The provided tests will not do this for you.

In [ ]:

def code_to_country(code):
    ...

In [ ]:

grader.check("q3_1")

Question 3.2. Use your code_to_country function and the .apply method to convert all of the country codes in the 'Country' column of ramen_data into country names. Do this without creating an additional column or reordering the existing columns. Assign the resulting DataFrame to the variable name ramen.

In [ ]:

ramen = ...
ramen

In [ ]:

grader.check("q3_2")

🚨 Important: For the rest of the questions in this section, use the DataFrame ramen instead of ramen_data.

Question 3.3. Define a function named word_count that returns the number of words in a ramen's 'Variety'. It should take as input a string from the 'Variety' column and return the number of words in that string. We'll consider a piece of text to be a word if and only if it is separated from adjacent words by a space. For example:

word_count('Cup Noodles Chicken Vegetable') should return 4.
word_count('Tonkotsu-Shoyu Rich Pork Flavor Ramen') should return 5. Notice that 'Tonkotsu-Shoyu' counts as one word.

Hint: The string method .split will be helpful.

In [ ]:

def word_count(variety):
    ...
    
# Test cases for your own reference. Feel free to test out more!
print(word_count('Cup Noodles Chicken Vegetable'))  # Should print 4
print(word_count('Tonkotsu-Shoyu Rich Pork Flavor Ramen')) # Should print 5

In [ ]:

grader.check("q3_3")

Question 3.4. Create a DataFrame called with_word_count with columns from left to right 'Brand', 'Country', 'Variety' and 'Stars' and a new column 'Word_Count' that has the word count for each variety. Sort the DataFrame in descending order of 'Word_Count'.

Note: The 'Country' column should have full country names, not codes.

In [ ]:

with_word_count = ...
with_word_count

In [ ]:

grader.check("q3_4")

Question 3.5. How many words does the longest ramen 'Variety' have? Assign this number to most_ramen_words. How many words does the shortest ramen 'Variety' have? Assign this number to fewest_ramen_words. What is the absolute difference between these values? Assign this number to range_ramen_words.

In [ ]:

most_ramen_words = ...
fewest_ramen_words = ...
range_ramen_words = ...

print('Most ramen words', most_ramen_words)
print('Fewest ramen words:', fewest_ramen_words)
print('Range of ramen words:', range_ramen_words)

In [ ]:

grader.check("q3_5")

Question 3.6. Create a function named mean_word_count that takes as an input the name of a ramen brand and returns the average 'Word_Count' for all ramen varieties belonging to that brand.

In [ ]:

def mean_word_count(brand):
    ...

In [ ]:

grader.check("q3_6")

Question 3.7. Create a horizontal bar chart that displays the mean word count for all ramen brands that have more than 10 varieties. Sort the bars so the brands whose varieties have the most words on average appear at the very top, and those with the fewest words on average appear at the bottom.

Hints:

If you use .groupby more than once on the same DataFrame, the order of rows will be the same, even with different aggregation methods, as long as the column you group by is the same.
To get the bar chart to display nicely, try adjusting the optional figsize argument, as we did in Lecture 8.

In [ ]:

# Create your plot here
...

Question 3.8. Define a function named point_total that takes in a full country name and returns a point total for that country's ramen, according to the following scheme:

1 point for every variety of ramen that has at least 1 star and less than 3 stars,
2 points for every variety with at least 3 stars and less than 4 stars, and
3 points for every variety with at least 4 stars (and at most 5 stars, which is the maximum possible).

|Points Received | Stars (Condition)| | --- | --- | |1| $[1,3)$ | |2| $[3, 4)$ | |3| $[4,5]$ |

Hint: Make sure that your function works for countries that don't have varieties of ramen at every possible number of stars. If you aren't able to accomplish this using grouping, try another strategy! Remember, don't use a for-loop. There is a better solution using DataFrame manipulations.

In [ ]:

def point_total(country):
    ...

In [ ]:

grader.check("q3_8")

Question 3.9. Among the five countries listed below, which has the highest point total, using the points system from Question 3.8?

'United States'
'Canada'
'Sweden'
'China'
'Japan'

Save the name (not country code) of the country as country and the country's number of points as points. You can set the value of country and points by hand for this question based on the output of the function you just wrote, for various inputs.

In [ ]:

country = ...
points = ...

In [ ]:

grader.check("q3_9")

Note: While they don't serve instant ramen, a new ramen restaurant just opened on campus, named Underbelly. It's the second ramen restaurant on campus, joining Showa, which opened earlier this year. Try them out and see how many stars they deserve! ⭐

4. Final Stretch 🧘‍♀️

Suppose we have a DataFrame called data with two numerical columns, 'x' and 'y'. Consider the following scatter plot, which was generated by calling data.plot(kind='scatter', x='x', y='y'):

Now consider these two histograms:

Histogram A

Histogram B

Question 4.1. Which of the following lines of code generated Histogram A? Assign either 1, 2, 3, or 4 to which_code.

data.plot(kind='hist', density=False, y='x')
data.plot(kind='hist', density=False, y='y')
data.plot(kind='hist', density=True, y='x')
data.plot(kind='hist', density=True, y='y')

In [ ]:

which_code = ...

In [ ]:

grader.check("q4_1")

Question 4.2. Suppose we run this block of code:

new_data = bpd.DataFrame().assign(
    x = data.get('x') / 6,
    y = data.get('y')
)

We then run

new_data.plot(kind='hist', density=True, y='x').

How will this histogram look compared to the histogram created by

data.plot(kind='hist', density=True, y='x'),

assuming both histograms are drawn on the same axes? Assign histogram_difference to either 1, 2, 3, or 4, corresponding to your choice.

The new_data histogram will be wider and taller than the data histogram.
The new_data histogram will be wider and shorter than the data histogram.
The new_data histogram will be narrower and taller than the data histogram.
The new_data histogram will be narrower and shorter than the data histogram.

Hint: Look at the end of Lecture 8 for an example of two histograms drawn on the same axes.

In [ ]:

histogram_difference = ...

In [ ]:

grader.check("q4_2")

Question 4.3. Below, we show Histogram B again.

What percent of values in Histogram B are between -4 (inclusive) and -2 (exclusive)? While we cannot answer this question exactly since we do not know where the bins start and end, we can still approximate the answer. Assign the variable percent_between to a number 1 through 5, corresponding to the closest answer.

In [ ]:

percent_between = ...

In [ ]:

grader.check("q4_3")

Finish Line

To submit your assignment:

Select Kernel -> Restart & Run All to ensure that you have executed all cells, including the test cells.
Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.

In [ ]:

grader.check_all()