Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: DanielBarnes18/IBM-Data-Science-Professional-Certificate
Path: blob/main/08. Data Visualization with Python/02. Exploring the Canada Immigration Dataset.ipynb
Views: ⁵¹⁵⁶

Kernel: Python

Exploring the Canada Immigration Dataset with pandas

The first thing we'll do is import two key data analysis modules: pandas and numpy.

In [2]:

import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library

Let's download and import our primary Canadian Immigration dataset using pandas's read_csv() method.

The file was originally downloaded from 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data Files/Canada.xlsx', and then prepared in the previous notebook.

In [6]:

df_can = pd.read_csv("canada_immigration_data.csv")

Set the country name as index - useful for quickly looking up countries using .loc method

In [7]:

df_can.set_index('Country', inplace=True)
df_can.head()

pandas Intermediate: Indexing and Selection (slicing)

Select Column

There are two ways to filter on a column name:

Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.

    df.column_name               # returns series

Method 2: More robust, and can filter on multiple columns.

    df['column']                  # returns series

    df[['column 1', 'column 2']]  # returns dataframe

Example: Let's try filtering on the list of countries ('Country').

In [18]:

df_can.Country  # returns a series

       Afghanistan
           Albania
           Algeria
    American Samoa
           Andorra
            ...      
        Viet Nam
  Western Sahara
           Yemen
          Zambia
        Zimbabwe
Name: Country, Length: 195, dtype: object

Let's try filtering on the list of countries ('Country') and the data for years: 1980 - 1985.

In [19]:

df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]] # returns a dataframe
# notice that 'Country' is string, and the years are integers. 
# for the sake of consistency, we will convert all column names to string later on.

Select Row

There are main 2 ways to select rows:

    df.loc[label]    # filters by the labels of the index/column
    df.iloc[index]   # filters by the positions of the index/column

Before we proceed, notice that the default index of the dataset is a numeric range from 0 to 194. This makes it very difficult to do a query by a specific country. For example to search for data on Japan, we need to know the corresponding index value.

This can be fixed very easily by setting the 'Country' column as the index using set_index() method.

In [20]:

df_can.set_index('Country', inplace=True)
# tip: The opposite of set is reset. So to reset the index, we can use df_can.reset_index()

In [21]:

df_can.head(3)

In [22]:

# optional: to remove the name of the index
df_can.index.name = None

Example: Let's view the number of immigrants from Japan (row 87) for the following scenarios: 1. The full row data (all columns) 2. For year 2013 3. For years 1980 to 1985

In [23]:

# 1. the full row data (all columns)
df_can.loc['Japan']

Continent                 Asia
Region            Eastern Asia
DevName      Developed regions
                     701
                     756
                     598
                     309
                     246
                     198
                     248
                     422
                     324
                     494
                     379
                     506
                     605
                     907
                     956
                     826
                     994
                     924
                     897
                    1083
                    1010
                    1092
                     806
                     817
                     973
                    1067
                    1212
                    1250
                    1284
                    1194
                    1168
                    1265
                    1214
                     982
Total                    27707
Name: Japan, dtype: object

In [24]:

# alternate methods
df_can.iloc[87]

Continent                 Asia
Region            Eastern Asia
DevName      Developed regions
                     701
                     756
                     598
                     309
                     246
                     198
                     248
                     422
                     324
                     494
                     379
                     506
                     605
                     907
                     956
                     826
                     994
                     924
                     897
                    1083
                    1010
                    1092
                     806
                     817
                     973
                    1067
                    1212
                    1250
                    1284
                    1194
                    1168
                    1265
                    1214
                     982
Total                    27707
Name: Japan, dtype: object

In [25]:

df_can[df_can.index == 'Japan']

In [26]:

# 2. for year 2013
df_can.loc['Japan', 2013]

982

In [27]:

# alternate method
# year 2013 is the last column, with a positional index of 36
df_can.iloc[87, 36]

982

In [28]:

# 3. for years 1980 to 1985
df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]]

  701
  756
  598
  309
  246
  246
Name: Japan, dtype: object

In [29]:

# Alternative Method
df_can.iloc[87, [3, 4, 5, 6, 7, 8]]

  701
  756
  598
  309
  246
  198
Name: Japan, dtype: object

Column names that are integers (such as the years) might introduce some confusion. For example, when we are referencing the year 2013, one might confuse that when the 2013th positional index.

To avoid this ambuigity, let's convert the column names into strings: '1980' to '2013'.

In [30]:

df_can.columns = list(map(str, df_can.columns))
# [print (type(x)) for x in df_can.columns.values] #<-- uncomment to check type of column headers

Since we converted the years to string, let's declare a variable that will allow us to easily call upon the full range of years:

In [31]:

# useful for plotting later on
years = list(map(str, range(1980, 2014)))
years

['1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013']

Filtering based on a criteria

To filter the dataframe based on a condition, we simply pass the condition as a boolean vector.

For example, Let's filter the dataframe to show the data on Asian countries (AreaName = Asia).

In [32]:

# 1. create the condition boolean series
condition = df_can['Continent'] == 'Asia'
print(condition)

Afghanistan        True
Albania           False
Algeria           False
American Samoa    False
Andorra           False
                  ...  
Viet Nam           True
Western Sahara    False
Yemen              True
Zambia            False
Zimbabwe          False
Name: Continent, Length: 195, dtype: bool

In [33]:

# 2. pass this condition into the dataFrame
df_can[condition]

In [34]:

# we can pass multiple criteria in the same line.
# let's filter for AreaNAme = Asia and RegName = Southern Asia

df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]

# note: When using 'and' and 'or' operators, pandas requires we use '&' and '|' instead of 'and' and 'or'
# don't forget to enclose the two conditions in parentheses

Before we proceed: let's review the changes we have made to our dataframe.

In [35]:

print('data dimensions:', df_can.shape)
print(df_can.columns)
df_can.head(2)

data dimensions: (195, 38)
Index(['Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', 'Total'],
      dtype='object')

Visualizing Data using Matplotlib

Matplotlib: Standard Python Visualization Library

The primary plotting library we will explore in the course is Matplotlib. As mentioned on their website:

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

If you are aspiring to create impactful visualization with python, Matplotlib is an essential tool to have at your disposal.

Matplotlib.Pyplot

One of the core aspects of Matplotlib is matplotlib.pyplot. It is Matplotlib's scripting layer which we studied in details in the videos about Matplotlib. Recall that it is a collection of command style functions that make Matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. In this lab, we will work with the scripting layer to learn how to generate line plots. In future labs, we will get to work with the Artist layer as well to experiment first hand how it differs from the scripting layer.

Let's start by importing matplotlib and matplotlib.pyplot as follows:

In [36]:

# we are using the inline backend
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

*optional: check if Matplotlib is loaded.

In [37]:

print('Matplotlib version: ', mpl.__version__)  # >= 2.0.0

Matplotlib version:  3.3.4

*optional: apply a style to Matplotlib.

In [38]:

print(plt.style.available)
mpl.style.use(['ggplot']) # optional: for ggplot-like style

['Solarize_Light2', '_classic_test_patch', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']

Plotting in pandas

Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in pandas is as simple as appending a .plot() method to a series or dataframe.

Documentation:

Line Pots (Series/Dataframe)

What is a line plot and why use it?

A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.

Let's start with a case study:

In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The quake caused widespread devastation and loss of life and aout three million people were affected by this natural disaster. As part of Canada's humanitarian effort, the Government of Canada stepped up its effort in accepting refugees from Haiti. We can quickly visualize this effort using a Line plot:

Question: Plot a line graph of immigration from Haiti using df.plot().

First, we will extract the data series for Haiti.

In [39]:

haiti = df_can.loc['Haiti', years] # passing in years 1980 - 2013 to exclude the 'total' column
haiti.head()

  1666
  3692
  3498
  2860
  1418
Name: Haiti, dtype: object

Next, we will plot a line plot by appending .plot() to the haiti dataframe.

In [40]:

haiti.plot()

<AxesSubplot:>

pandas automatically populated the x-axis with the index values (years), and the y-axis with the column values (population). However, notice how the years were not displayed because they are of type string. Therefore, let's change the type of the index values to integer for plotting.

Also, let's label the x and y axis using plt.title(), plt.ylabel(), and plt.xlabel() as follows:

In [41]:

haiti.index = haiti.index.map(int) # let's change the index values of Haiti to type integer for plotting
haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

plt.show() # need this line to show the updates made to the figure

We can clearly notice how number of immigrants from Haiti spiked up from 2010 as Canada stepped up its efforts to accept refugees from Haiti. Let's annotate this spike in the plot by using the plt.text() method.

In [42]:

haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

# annotate the 2010 Earthquake. 
# syntax: plt.text(x, y, label)
plt.text(2000, 6000, '2010 Earthquake') # see note below

plt.show()

Quick note on x and y values in plt.text(x, y, label):

 Since the x-axis (years) is type 'integer', we specified x as a year. The y axis (number of immigrants) is type 'integer', so we can just specify the value y = 6000.

    plt.text(2000, 6000, '2010 Earthquake') # years stored as type int

If the years were stored as type 'string', we would need to specify x as the index position of the year. Eg 20th index is year 2000 since it is the 20th year with a base year of 1980.

    plt.text(20, 6000, '2010 Earthquake') # years stored as type int

We will cover advanced annotation methods in later modules.

We can easily add more countries to line plot to make meaningful comparisons immigration from different countries.

Question: Let's compare the number of immigrants from India and China from 1980 to 2013.

Step 1: Get the data set for China and India, and display the dataframe.

In [44]:

### type your answer here
df_CI = df_can.loc[['China','India'], years]
df_CI

Step 2: Plot graph. We will explicitly specify line plot by passing in kind parameter to plot().

In [45]:

### type your answer here
df_CI.plot(kind='line')

<AxesSubplot:>

That doesn't look right...

Recall that pandas plots the indices on the x-axis and the columns as individual lines on the y-axis. Since df_CI is a dataframe with the country as the index and years as the columns, we must first transpose the dataframe using transpose() method to swap the row and columns.

In [46]:

df_CI = df_CI.transpose()
df_CI.head()

pandas will auomatically graph the two countries on the same graph. Go ahead and plot the new transposed dataframe. Make sure to add a title to the plot and label the axes.

In [50]:

### type your answer here

df_CI.index = df_CI.index.map(int) # let's change the index values of df_CI to type integer for plotting
df_CI.plot(kind='line')
plt.title('Immigration from China and India')
plt.ylabel('Number of Immigrants')
plt.xlabel('Year')

plt.show()

From the above plot, we can observe that the China and India have very similar immigration trends through the years.

Note: How come we didn't need to transpose Haiti's dataframe before plotting (like we did for df_CI)?

That's because haiti is a series as opposed to a dataframe, and has the years as its indices as shown below.

print(type(haiti))
print(haiti.head(5))

class 'pandas.core.series.Series'
1980 1666
1981 3692
1982 3498
1983 2860
1984 1418
Name: Haiti, dtype: int64

Line plot is a handy tool to display several dependent variables against one independent variable. However, it is recommended that no more than 5-10 lines on a single graph; any more than that and it becomes difficult to interpret.

Question: Compare the trend of top 5 countries that contributed the most to immigration to Canada.

In [51]:

### type your answer here

#Step 1: Get the dataset. Recall that we created a Total column that calculates cumulative immigration by country. 
    #We will sort on this column to get our top 5 countries using pandas sort_values() method.
    
    inplace = True # paramemter saves the changes to the original df_can dataframe
    df_can.sort_values(by='Total', ascending=False, axis=0, inplace=True)

    # get the top 5 entries
    df_top5 = df_can.head(5)

    # transpose the dataframe
    df_top5 = df_top5[years].transpose() 

    print(df_top5)


    #Step 2: Plot the dataframe. To make the plot more readeable, we will change the size using the `figsize` parameter.
    df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting
    df_top5.plot(kind='line', figsize=(14, 8)) # pass a tuple (x, y) size



    plt.title('Immigration Trend of Top 5 Countries')
    plt.ylabel('Number of Immigrants')
    plt.xlabel('Years')


    plt.show()

      India  China  United Kingdom of Great Britain and Northern Ireland  \
 8880   5123                                              22045      
 8670   6682                                              24796      
 8147   3308                                              20620      
 7338   1863                                              10015      
 5704   1527                                              10170      
 4211   1816                                               9564      
 7150   1960                                               9470      
10189   2643                                              21337      
11522   2758                                              27359      
10343   4323                                              23795      
12041   8076                                              31668      
13734  14255                                              23380      
13673  10846                                              34123      
21496   9817                                              33720      
18620  13128                                              39231      
18489  14398                                              30145      
23859  19415                                              29322      
22268  20475                                              22965      
17241  21049                                              10367      
18974  30069                                               7045      
28572  35529                                               8840      
31223  36434                                              11728      
31889  31961                                               8046      
27155  36439                                               6797      
28235  36619                                               7533      
36210  42584                                               7258      
33848  33518                                               7140      
28742  27642                                               8216      
28261  30037                                               8979      
29456  29622                                               8876      
34235  30391                                               8724      
27509  28502                                               6204      
30933  33024                                               6195      
33087  34129                                               5827      

      Philippines  Pakistan  
       6051       978  
       5921       972  
       5249      1201  
       4562       900  
       3801       668  
       3150       514  
       4166       691  
       7360      1072  
       8639      1334  
      11865      2261  
      12509      2470  
      12718      3079  
      13670      4071  
      20479      4777  
      19532      4666  
      15864      4994  
      13692      9125  
      11549     13073  
       8735      9068  
       9734      9979  
      10763     15400  
      13836     16708  
      11707     15110  
      12758     13205  
      14004     13399  
      18139     14314  
      18400     13127  
      19837     10124  
      24887      8994  
      28573      7217  
      38617      6811  
      36765      7468  
      34315     11227  
      29544     12603  

Other Plots

There are many other plotting styles available other than the default Line plot, all of which can be accessed by passing kind keyword to plot(). The full list of available plots are as follows:

bar for vertical bar plots
barh for horizontal bar plots
hist for histogram
box for boxplot
kde or density for density plots
area for area plots
pie for pie plots
scatter for scatter plots
hexbin for hexbin plot

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Exploring the Canada Immigration Dataset with pandas

pandas Intermediate: Indexing and Selection (slicing)

Select Column

Select Row

Filtering based on a criteria

Visualizing Data using Matplotlib

Matplotlib: Standard Python Visualization Library

Matplotlib.Pyplot

Plotting in pandas

Line Pots (Series/Dataframe)

Other Plots

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Exploring the Canada Immigration Dataset with pandas

pandas Intermediate: Indexing and Selection (slicing)

Select Column

Select Row

Filtering based on a criteria

Visualizing Data using Matplotlib

Matplotlib: Standard Python Visualization Library

Matplotlib.Pyplot

Plotting in pandas

Line Pots (Series/Dataframe)

Other Plots

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.