CoCalc -- notebook.ipynb

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: veeralakrishna/DataCamp-Project-Solutions-Python
Path: blob/master/A Visual History of Nobel Prize Winners/notebook.ipynb
Views: ¹²²⁹

Kernel: Python 3

1. The most Nobel of Prizes

The Nobel Prize is perhaps the world's most well known scientific award. Except for the honor, prestige and substantial prize money the recipient also gets a gold medal showing Alfred Nobel (1833 - 1896) who established the prize. Every year it's given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace. The first Nobel Prize was handed out in 1901, and at that time the Prize was very Eurocentric and male-focused, but nowadays it's not biased in any way whatsoever. Surely. Right?

Well, we're going to find out! The Nobel Foundation has made a dataset available of all prize winners from the start of the prize, in 1901, to 2016. Let's load it in and take a look.

In [136]:

# Loading in required libraries
import pandas as pd                                  # for data manipulation
import seaborn as sns                                # for data visualization
# for working with arrays and numerical values
import numpy as np
import klib as kb                                    # for efficient data cleaning
import matplotlib.pyplot as plt                      # for plotting
from matplotlib.ticker import PercentFormatter       # for plotting percent

In [137]:

# Reading in the Nobel Prize data
nobel = pd.read_csv("datasets/nobel.csv")

Let's have an overview of our dataset

In [138]:

nobel.shape  # Returns the dimensions (rows and columns) of our dataset

(911, 18)

In [139]:

nobel.info()  # Display concise information about the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 year                  911 non-null    int64 
 category              911 non-null    object
 prize                 911 non-null    object
 motivation            823 non-null    object
 prize_share           911 non-null    object
 laureate_id           911 non-null    int64 
 laureate_type         911 non-null    object
 full_name             911 non-null    object
 birth_date            883 non-null    object
 birth_city            883 non-null    object
birth_country         885 non-null    object
sex                   885 non-null    object
organization_name     665 non-null    object
organization_city     667 non-null    object
organization_country  667 non-null    object
death_date            593 non-null    object
death_city            576 non-null    object
death_country         582 non-null    object
dtypes: int64(2), object(16)
memory usage: 128.2+ KB

As we can observe that the column data types are not appropriately represented. To address this, we either use Python's conversion function, or we can use the streamlined and efficient technique offered by the 'klib' library—a better choice in terms of simplicity, effectiveness, and time efficiency.

In [140]:

clean_col_names = kb.data_cleaning(nobel)
clean_col_names.info()

Shape of cleaned data: (911, 18) - Remaining NAs: 1912


Dropped rows: 0
     of which 0 duplicates. (Rows (first 150 shown): [])

Dropped columns: 0
     of which 0 single valued.     Columns: []
Dropped missing values: 0
Reduced memory by at least: 0.04 MB (-30.77%)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 911 entries, 0 to 910
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   year                  911 non-null    int16   
 1   category              911 non-null    category
 2   prize                 911 non-null    string  
 3   motivation            823 non-null    string  
 4   prize_share           911 non-null    category
 5   laureate_id           911 non-null    int16   
 6   laureate_type         911 non-null    category
 7   full_name             911 non-null    string  
 8   birth_date            883 non-null    string  
 9   birth_city            883 non-null    string  
 10  birth_country         885 non-null    string  
 11  sex                   885 non-null    category
 12  organization_name     665 non-null    string  
 13  organization_city     667 non-null    string  
 14  organization_country  667 non-null    string  
 15  death_date            593 non-null    string  
 16  death_city            576 non-null    string  
 17  death_country         582 non-null    string  
dtypes: category(4), int16(2), string(12)
memory usage: 93.3 KB

Let's check for null values in our dataset

In [141]:

nobel.isnull().sum()

year                      0
category                  0
prize                     0
motivation               88
prize_share               0
laureate_id               0
laureate_type             0
full_name                 0
birth_date               28
birth_city               28
birth_country            26
sex                      26
organization_name       246
organization_city       244
organization_country    244
death_date              318
death_city              335
death_country           329
dtype: int64

As we can see, there are a significant number of null values in our dataset. So, now let's try to handle them

In [142]:

# Load the dataset
nobel = pd.read_csv(
    "../A Visual History of Nobel Prize Winners/datasets/nobel.csv")

# Impute missing values in 'motivation', 'birth_city', 'birth_country', and 'sex'
nobel["motivation"].fillna("No motivation provided", inplace=True)
nobel["birth_city"].fillna("Unknown", inplace=True)
nobel["birth_country"].fillna("Unknown", inplace=True)
nobel["sex"].fillna("Unknown", inplace=True)

# Convert 'birth_date' to datetime type
nobel["birth_date"] = pd.to_datetime(nobel["birth_date"])

# Calculate the median birth year
median_birth_year = nobel["birth_date"].dropna().dt.year.median()

# Impute missing birth dates based on the median year
nobel["birth_date"] = nobel["birth_date"].fillna(
    pd.to_datetime(str(int(median_birth_year)) + "-01-01")
)

# Drop irrelevant columns
nobel.drop(
    [
        "death_date",
        "death_city",
        "death_country",
        "organization_name",
        "organization_city",
        "organization_country",
    ],
    axis=1,
    inplace=True,
)

nobel.isnull().sum()

year             0
category         0
prize            0
motivation       0
prize_share      0
laureate_id      0
laureate_type    0
full_name        0
birth_date       0
birth_city       0
birth_country    0
sex              0
dtype: int64

Explanation:

Imputation tailored to each column: For "motivation," we used a placeholder, and for categorical columns "birth_city," "birth_country," and "sex," we filled with "Unknown."
Dropping irrelevant columns: High-missing columns like "death_date," "death_city," "death_country" were dropped for focused analysis.
For birth year, we follow an approach of calculating the median birth year and using it to estimate missing birth dates. This considers the dataset's dynamic nature, ensuring contextual relevance and consistency in imputing birth dates while respecting unique birth years.

Let's start with our analysis

In [143]:

# Taking a look at the first several winners
nobel.head(6)

2. So, who gets the Nobel Prize?

Just looking at the first couple of prize winners, or Nobel laureates as they are also called, we already see a celebrity: Wilhelm Conrad Röntgen, the guy who discovered X-rays. And actually, we see that all of the winners in 1901 were guys that came from Europe. But that was back in 1901, looking at all winners in the dataset, from 1901 to 2016, which sex and which country is the most commonly represented?

(For country, we will use the birth_country of the winner, as the organization_country is NaN for all shared Nobel Prizes.)

In [144]:

# Display the number of (possibly shared) Nobel Prizes handed out between 1901 and 2016
print(nobel["prize_share"].value_counts())

# Display the number of prizes won by male and female recipients.
print(nobel["sex"].value_counts())

# Display the number of prizes won by the top 10 nationalities.
nobel["birth_country"].value_counts().head(10)

prize_share
1/1    344
1/2    306
1/3    201
1/4     60
Name: count, dtype: int64
sex
Male       836
Female      49
Unknown     26
Name: count, dtype: int64

birth_country
United States of America    259
United Kingdom               85
Germany                      61
France                       51
Sweden                       29
Unknown                      26
Japan                        24
Canada                       18
Netherlands                  18
Italy                        17
Name: count, dtype: int64

3. USA dominance

Not so surprising perhaps: the most common Nobel laureate between 1901 and 2016 was a man born in the United States of America. But in 1901 all the winners were European. When did the USA start to dominate the Nobel Prize charts?

In [155]:

# Calculating the proportion of USA born winners per decade
nobel['usa_born_winner'] = nobel['birth_country'] == "United States of America"
nobel['decade'] = (np.floor(nobel['year'] / 10) * 10).astype(int)
prop_usa_winners = nobel.groupby('decade', as_index=False)[
    'usa_born_winner'].mean()

# Display the proportions of USA born winners per decade
display(prop_usa_winners)

4. USA dominance, visualized

A table is OK, but to see when the USA started to dominate the Nobel charts we need a plot!

In [146]:

# Setting the plotting theme
sns.set()
# and setting the size of all plots.
plt.rcParams['figure.figsize'] = [11, 7]

# Plotting USA born winners
ax = sns.lineplot(x=prop_usa_winners['decade'],
                  y=prop_usa_winners['usa_born_winner'])

# Adding %-formatting to the y-axis
ax.yaxis.set_major_formatter(PercentFormatter(1.0))

5. What is the gender of a typical Nobel Prize winner?

So the USA became the dominating winner of the Nobel Prize first in the 1930s and had kept the leading position ever since. But one group that was in the lead from the start, and never seems to let go, are men. Maybe it shouldn't come as a shock that there is some imbalance between how many male and female prize winners there are, but how significant is this imbalance? And is it better or worse within specific prize categories like physics, medicine, literature, etc.?

In [147]:

# Calculating the proportion of female laureates per decade
nobel['female_winner'] = nobel['sex'] == 'Female'
prop_female_winners = nobel.groupby(['decade', 'category'], as_index=False)[
    'female_winner'].mean()

# Plotting USA born winners with % winners on the y-axis

ax = sns.lineplot(x='decade', y='female_winner',
                  hue='category', data=prop_female_winners)

ax.yaxis.set_major_formatter(PercentFormatter(1.0))

6. The first woman to win the Nobel Prize

The plot above is a bit messy as the lines are overplotting. But it does show some interesting trends and patterns. Overall the imbalance is pretty large with physics, economics, and chemistry having the largest imbalance. Medicine has a somewhat positive trend, and since the 1990s the literature prize is also now more balanced. The big outlier is the peace prize during the 2010s, but keep in mind that this just covers the years 2010 to 2016.

Given this imbalance, who was the first woman to receive a Nobel Prize? And in what category?

In [181]:

# # Picking out the first woman to win a Nobel Prize
# # To get the year from a datetime column you need to use access the dt.year value.
# # Here is an example:
# # a_data_frame['a_datatime_column'].dt.year

# # nobel['Female']

# nobel.nsmallest(1, "year")

# Filter and sort the DataFrame to include only female winners and reset the index
first_female_winner = (
    nobel[nobel["sex"] == "Female"].sort_values(
        "year").reset_index(drop=True).iloc[0]
)

# Display the information of the first female winner
first_female_winner

year                                                            1903
category                                                     Physics
prize                                The Nobel Prize in Physics 1903
motivation         "in recognition of the extraordinary services ...
prize_share                                                      1/4
laureate_id                                                        6
laureate_type                                             Individual
full_name                                Marie Curie, née Sklodowska
birth_date                                       1867-11-07 00:00:00
birth_city                                                    Warsaw
birth_country                                Russian Empire (Poland)
sex                                                           Female
usa_born_winner                                                False
decade                                                          1900
female_winner                                                   True
age                                                               36
Name: 0, dtype: object

7. Repeat laureates

For most scientists/writers/activists a Nobel Prize would be the crowning achievement of a long career. But for some people, one is just not enough, and few have gotten it more than once. Who are these lucky few? (Having won no Nobel Prize myself, I'll assume it's just about luck.)

In [149]:

# Selecting the laureates that have received 2 or more prizes.
nobel.groupby("full_name").filter(lambda group: len(group) >= 2)

8. How old are you when you get the prize?

The list of repeat winners contains some illustrious names! We again meet Marie Curie, who got the prize in physics for discovering radiation and in chemistry for isolating radium and polonium. John Bardeen got it twice in physics for transistors and superconductivity, Frederick Sanger got it twice in chemistry, and Linus Carl Pauling got it first in chemistry and later in peace for his work in promoting nuclear disarmament. We also learn that organizations also get the prize as both the Red Cross and the UNHCR have gotten it twice.

But how old are you generally when you get the prize?

In [182]:

# Converting birth_date from String to datetime
nobel['birth_date'] = pd.to_datetime(nobel['birth_date'])

# Calculating the age of Nobel Prize winners
nobel['age'] = nobel['year'] - nobel['birth_date'].dt.year

# Plotting the age of Nobel Prize winners
sns.lmplot(x='year', y='age', data=nobel, lowess=True,
           aspect=2, line_kws={'color': 'black'})
# By using the ; (semi-colon), the output object will not be displayed

9. Age differences between prize categories

The plot above shows us a lot! We see that people use to be around 55 when they received the price, but nowadays the average is closer to 65. But there is a large spread in the laureates' ages, and while most are 50+, some are very young.

We also see that the density of points is much high nowadays than in the early 1900s -- nowadays many more of the prizes are shared, and so there are many more winners. We also see that there was a disruption in awarded prizes around the Second World War (1939 - 1945).

Let's look at age trends within different prize categories.

In [151]:

# Same plot as above, but separate plots for each type of Nobel Prize
sns.lmplot(x='year', y='age', data=nobel, row='category',
           lowess=True, aspect=2, line_kws={'color': 'black'})

<seaborn.axisgrid.FacetGrid at 0x24a54272250>

10. Oldest and youngest winners

More plots with lots of exciting stuff going on! We see that both winners of the chemistry, medicine, and physics prize have gotten older over time. The trend is strongest for physics: the average age used to be below 50, and now it's almost 70. Literature and economics are more stable. We also see that economics is a newer category. But peace shows an opposite trend where winners are getting younger!

In the peace category we also a winner around 2010 that seems exceptionally young. This begs the questions, who are the oldest and youngest people ever to have won a Nobel Prize?

In [152]:

# The oldest winner of a Nobel Prize as of 2016
display(nobel.nlargest(1, "age"))

# The youngest winner of a Nobel Prize as of 2016
display(nobel.nsmallest(1, "age"))

11. You get a prize!

Hey! You get a prize for making it to the very end of this notebook! It might not be a Nobel Prize, but I made it myself in paint so it should count for something. But don't despair, Leonid Hurwicz was 90 years old when he got his prize, so it might not be too late for you. Who knows.

Before you leave, what was again the name of the youngest winner ever who in 2014 got the prize for "[her] struggle against the suppression of children and young people and for the right of all children to education"?

In [183]:

# The name of the youngest winner of the Nobel Prize as of 2016
youngest_winner = "Malala Yousafzai"

In [ ]:

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

1. The most Nobel of Prizes

Let's have an overview of our dataset

Let's check for null values in our dataset

Let's start with our analysis

2. So, who gets the Nobel Prize?

3. USA dominance

4. USA dominance, visualized

5. What is the gender of a typical Nobel Prize winner?

6. The first woman to win the Nobel Prize

7. Repeat laureates

8. How old are you when you get the prize?

9. Age differences between prize categories

10. Oldest and youngest winners

11. You get a prize!

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

1. The most Nobel of Prizes

Let's have an overview of our dataset

Let's check for null values in our dataset

Let's start with our analysis

2. So, who gets the Nobel Prize?

3. USA dominance

4. USA dominance, visualized

5. What is the gender of a typical Nobel Prize winner?

6. The first woman to win the Nobel Prize

7. Repeat laureates

8. How old are you when you get the prize?

9. Age differences between prize categories

10. Oldest and youngest winners

11. You get a prize!

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.