GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Day 4 Data Visualization.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

Data Visualization

Objectives

Create Data Visualization with Python
Use various Python libraries for visualization

https://mockaroo.com/ (for sample data generation)

The Dataset: Immigration to Canada from 1980 to 2013

In [1]:

import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library
can=pd.read_csv("canadian_im.csv")

Let's view the top 2 rows of the dataset using the head() function.

In [2]:

can.head(2)
#can.tail(2)
# tip: You can specify the number of rows you'd like to see as follows: can.tail(2)

Out[2]:

When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We can do this by using the info() method.

This method can be used to get a short summary of the dataframe.

In [3]:

can.shape

Out[3]:

(195, 39)

Insight

We have immgration information for 195 Countries form 1980 - 2013

In [4]:

can.info()

Out[4]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 39 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 Country    195 non-null    object
 Continent  195 non-null    object
 Region     195 non-null    object
 DevName    195 non-null    object
 1980       195 non-null    int64 
 1981       195 non-null    int64 
 1982       195 non-null    int64 
 1983       195 non-null    int64 
 1984       195 non-null    int64 
 1985       195 non-null    int64 
1986       195 non-null    int64 
1987       195 non-null    int64 
1988       195 non-null    int64 
1989       195 non-null    int64 
1990       195 non-null    int64 
1991       195 non-null    int64 
1992       195 non-null    int64 
1993       195 non-null    int64 
1994       195 non-null    int64 
1995       195 non-null    int64 
1996       195 non-null    int64 
1997       195 non-null    int64 
1998       195 non-null    int64 
1999       195 non-null    int64 
2000       195 non-null    int64 
2001       195 non-null    int64 
2002       195 non-null    int64 
2003       195 non-null    int64 
2004       195 non-null    int64 
2005       195 non-null    int64 
2006       195 non-null    int64 
2007       195 non-null    int64 
2008       195 non-null    int64 
2009       195 non-null    int64 
2010       195 non-null    int64 
2011       195 non-null    int64 
2012       195 non-null    int64 
2013       195 non-null    int64 
Total      195 non-null    int64 
dtypes: int64(35), object(4)
memory usage: 59.5+ KB

To get the list of column headers we can call upon the data frame's columns instance variable.

In [5]:

can.columns

Out[5]:

Index(['Country', 'Continent', 'Region', 'DevName', '1980', '1981', '1982',
       '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991',
       '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000',
       '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009',
       '2010', '2011', '2012', '2013', 'Total'],
      dtype='object')

Similarly, to get the list of indices we use the .index instance variables.

In [6]:

can.index

Out[6]:

RangeIndex(start=0, stop=195, step=1)

To get the index and columns as lists, we can use the tolist() method.

In [7]:

can.columns.tolist()

Out[7]:

['Country',
 'Continent',
 'Region',
 'DevName',
 '1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 'Total']

In [8]:

can.index.tolist()

Out[8]:

In [9]:

print(type(can.columns.tolist()))
print(type(can.index.tolist()))

Out[9]:

<class 'list'>
<class 'list'>

Statistical Analysis

In [10]:

can.isnull().sum()

Out[10]:

Country      0
Continent    0
Region       0
DevName      0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
Total        0
dtype: int64

Finally, let's view a quick summary of each column in our dataframe using the describe() method.

In [11]:

can.describe()

Out[11]:

In [12]:

can.describe(include="object")

Out[12]:

Insights

Total 195 Countries
22 Different Regions and highest immigration to Canada is form Western Asia
78% of immigration to Canada is form Developing Regions

pandas Intermediate: Indexing and Selection (slicing)

Select Column

There are two ways to filter on a column name:

Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.

    df.column_name               # returns series

Method 2: More robust, and can filter on multiple columns.

    df['column']                  # returns series

    df[['column 1', 'column 2']]  # returns dataframe

Example: Let's try filtering on the list of countries ('Country').

In [13]:

can.Country  # returns a series

Out[13]:

       Afghanistan
           Albania
           Algeria
    American Samoa
           Andorra
            ...      
        Viet Nam
  Western Sahara
           Yemen
          Zambia
        Zimbabwe
Name: Country, Length: 195, dtype: object

In [14]:

a=can[["Country"]]
a

Out[14]:

Let's try filtering on the list of countries ('Country') and the data for years: 1980 - 1985.

In [15]:

can[['Country', '1980', '1981', '1982', '1983', '1984', '1985']] # returns a dataframe
# notice that 'Country' is string, and the years are strings.

Out[15]:

Select Row

There are main 2 ways to select rows:

    df.loc[label]    # filters by the labels of the index/column
    df.iloc[index]   # filters by the positions of the index/column

Before we proceed, notice that the default index of the dataset is a numeric range from 0 to 194. This makes it very difficult to do a query by a specific country. For example to search for data on Japan, we need to know the corresponding index value.

This can be fixed very easily by setting the 'Country' column as the index using set_index() method.

In [18]:

can.iloc[3:6]

Out[18]:

Setting Country as index column

In [19]:

a1=[1,2,3]
a1[-1]
a2=pd.Series(a1)
a2

Out[19]:

  1
  2
  3
dtype: int64

In [20]:

can.head()

Out[20]:

In [21]:

can.set_index('Country', inplace=True)
# tip: The opposite of set is reset. So to reset the index, we can use df_can.reset_index()

In [22]:

can.tail(3)

Out[22]:

# optional: to remove the name of the index
can.index.name = None

Example: Let's view the number of immigrants from Japan (row 87) for the following scenarios:

1. The full row data (all columns)
2. For year 2013
3. For years 1980 to 1985

In [23]:

# 1. the full row data (all columns)
can.loc['Japan']

Out[23]:

Continent                 Asia
Region            Eastern Asia
DevName      Developed regions
                     701
                     756
                     598
                     309
                     246
                     198
                     248
                     422
                     324
                     494
                     379
                     506
                     605
                     907
                     956
                     826
                     994
                     924
                     897
                    1083
                    1010
                    1092
                     806
                     817
                     973
                    1067
                    1212
                    1250
                    1284
                    1194
                    1168
                    1265
                    1214
                     982
Total                    27707
Name: Japan, dtype: object

# alternate methods
can.iloc[87]

In [24]:

can[can.index == 'Japan']

Out[24]:

Quick Parctice

How many number of immigrants from india in 2000?
How many immigrants from China in 1998?
Number of immigrants from India between 1988-2000?

In [25]:

# 2. for year 2013
can.loc['Japan', '2013']

Out[25]:

982

# alternate method
# year 2013 is the last column, with a positional index of 36
can.iloc[87, 36]

In [26]:

# 3. for years 1980 to 1985
can.loc['Japan', ['1980', '1981', '1982', '1983', '1984', '1985']]

Out[26]:

  701
  756
  598
  309
  246
  198
Name: Japan, dtype: object

# Alternative Method
can.iloc[87, [3, 4, 5, 6, 7, 8]]

Column names that are integers (such as the years) might introduce some confusion. For example, when we are referencing the year 2013, one might confuse that when the 2013th positional index.

To avoid this ambuigity, let's convert the column names into strings: '1980' to '2013'.

In [27]:

can.columns = list(map(str, can.columns))
# [print (type(x)) for x in df_can.columns.values] #<-- uncomment to check type of column headers

Since we converted the years to string, let's declare a variable that will allow us to easily call upon the full range of years:

In [28]:

# useful for plotting later on
years = list(map(str, range(1980, 2014)))
years

Out[28]:

['1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013']

Filtering based on a criteria

To filter the dataframe based on a condition, we simply pass the condition as a boolean vector.

For example, Let's filter the dataframe to show the data on Asian countries (AreaName = Asia).

In [29]:

# 1. create the condition boolean series
Asian_Info = can['Continent'] == 'Asia'
print(Asian_Info )

Out[29]:

Country
Afghanistan        True
Albania           False
Algeria           False
American Samoa    False
Andorra           False
                  ...  
Viet Nam           True
Western Sahara    False
Yemen              True
Zambia            False
Zimbabwe          False
Name: Continent, Length: 195, dtype: bool

In [30]:

# 2. pass this condition into the dataFrame
can[Asian_Info]

Out[30]:

QP

Create a condition to filter Africa data

In [31]:

# we can pass multiple criteria in the same line.
# let's filter for AreaNAme = Asia and RegName = Southern Asia
# | - OR , 1,0 - 1 , 0,1 - 1, 1,1- 1

can[(can['Continent']=='Asia') & (can['Region']=='Southern Asia')]

# note: When using 'and' and 'or' operators, pandas requires we use '&' and '|' instead of 'and' and 'or'
# don't forget to enclose the two conditions in parentheses

Out[31]:

Before we proceed: let's review the changes we have made to our dataframe.

In [32]:

print('data dimensions:', can.shape)
print(can.columns)
can.head(2)

Out[32]:

data dimensions: (195, 38)
Index(['Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', 'Total'],
      dtype='object')

Visualizing Data using Matplotlib

Matplotlib: Standard Python Visualization Library

The primary plotting library we will explore in the course is Matplotlib. As mentioned on their website:

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

If you are aspiring to create impactful visualization with python, Matplotlib is an essential tool to have at your disposal.

Matplotlib.Pyplot

One of the core aspects of Matplotlib is matplotlib.pyplot

Let's start by importing matplotlib and matplotlib.pyplot as follows:

In [33]:

# we are using the inline backend
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt

*optional: check if Matplotlib is loaded.

In [34]:

print('Matplotlib version: ', mpl.__version__)  # >= 2.0.0

Out[34]:

Matplotlib version:  3.5.1

In [35]:

print(plt.style.available)
mpl.style.use(['ggplot']) # optional: for ggplot-like style

Out[35]:

['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']

Plotting in pandas

Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in pandas is as simple as appending a .plot() method to a series or dataframe.

Documentation:

Line PLots (Series/Dataframe)

What is a line plot and why use it?

A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.

Let's start with a case study:

In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The quake caused widespread devastation and loss of life and aout three million people were affected by this natural disaster. As part of Canada's humanitarian effort, the Government of Canada stepped up its effort in accepting refugees from Haiti. We can quickly visualize this effort using a Line plot:

Question: Plot a line graph of immigration from Haiti using df.plot().

(Q.P - Plot a line graph of immigration from China using `df.plot())

First, we will extract the data series for Haiti.

In [36]:

haiti = can.loc['Haiti', years] # passing in years 1980 - 2013 to exclude the 'total' column
haiti.head()

Out[36]:

  1666
  3692
  3498
  2860
  1418
Name: Haiti, dtype: object

Next, we will plot a line plot by appending .plot() to the haiti dataframe.

In [37]:

haiti.plot()

Out[37]:

<AxesSubplot:>

pandas automatically populated the x-axis with the index values (years), and the y-axis with the column values (population). However, notice how the years were not displayed because they are of type string. Therefore, let's change the type of the index values to integer for plotting.

Also, let's label the x and y axis using plt.title(), plt.ylabel(), and plt.xlabel() as follows:

In [38]:

haiti.index = haiti.index.map(int) # let's change the index values of Haiti to type integer for plotting
haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

plt.show() # need this line to show the updates made to the figure

Out[38]:

We can clearly notice how number of immigrants from Haiti spiked up from 2010 as Canada stepped up its efforts to accept refugees from Haiti. Let's annotate this spike in the plot by using the plt.text() method.

In [47]:

haiti.plot(kind='line', figsize=(16, 8))

plt.title('Immigration from Haiti')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

# annotate the 2010 Earthquake. 
# syntax: plt.text(x, y, label)
plt.text(1985,3800, "peak of Immigration")
plt.text(2009, 6500, '2010 Earthquake') # see note below

plt.show()

Out[47]:

With just a few lines of code, you were able to quickly identify and visualize the spike in immigration!

Quick note on x and y values in plt.text(x, y, label):

 Since the x-axis (years) is type 'integer', we specified x as a year. The y axis (number of immigrants) is type 'integer', so we can just specify the value y = 6000.

    plt.text(2000, 6000, '2010 Earthquake') # years stored as type int

If the years were stored as type 'string', we would need to specify x as the index position of the year. Eg 20th index is year 2000 since it is the 20th year with a base year of 1980.

    plt.text(20, 6000, '2010 Earthquake') # years stored as type int

We will cover advanced annotation methods in later modules.

We can easily add more countries to line plot to make meaningful comparisons immigration from different countries.

Question: Let's compare the number of immigrants from India and China from 1980 to 2013.

Q.P: "Compare immigrants from Srilanka and Bhutan from 1980 tO 2013

Step 1: Get the data set for China and India, and display the dataframe.

In [40]:

df_CI = can.loc[['India', 'China'], years]
df_CI

Out[40]:

Step 2: Plot graph. We will explicitly specify line plot by passing in kind parameter to plot().

In [41]:



df_CI.plot(kind='line')

Out[41]:

<AxesSubplot:xlabel='Country'>

That doesn't look right...

Recall that pandas plots the indices on the x-axis and the columns as individual lines on the y-axis. Since df_CI is a dataframe with the country as the index and years as the columns, we must first transpose the dataframe using transpose() method to swap the row and columns.

In [42]:

df_CI = df_CI.transpose()
df_CI.head(2)

Out[42]:

pandas will auomatically graph the two countries on the same graph. Go ahead and plot the new transposed dataframe. Make sure to add a title to the plot and label the axes.

In [43]:



df_CI.index = df_CI.index.map(int) # let's change the index values of df_CI to type integer for plotting
df_CI.plot(kind='line',figsize=(15, 8))

plt.title('Immigrants from China and India')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

plt.show()

Out[43]:

From the above plot, we can observe that the China and India have very similar immigration trends through the years.

Note: How come we didn't need to transpose Haiti's dataframe before plotting (like we did for df_CI)?

That's because haiti is a series as opposed to a dataframe, and has the years as its indices as shown below.

print(type(haiti))
print(haiti.head(5))

class 'pandas.core.series.Series'
1980 1666
1981 3692
1982 3498
1983 2860
1984 1418
Name: Haiti, dtype: int64

Line plot is a handy tool to display several dependent variables against one independent variable. However, it is recommended that no more than 5-10 lines on a single graph; any more than that and it becomes difficult to interpret.

Question: Compare the trend of top 5 countries that contributed the most to immigration to Canada.

Q.P: Compare the trend of bottom 5 countries that contributed least to immigration to Canada.

In [48]:

#The correct answer is:    
#Step 1: Get the dataset. Recall that we created a Total column that calculates cumulative immigration by country. 
#We will sort on this column to get our top 5 countries using pandas sort_values() method.
    
inplace = True # paramemter saves the changes to the original can dataframe
can.sort_values(by='Total', ascending=False, axis=0, inplace=True)

# get the top 5 entries
df_top5 = can.head(5)

# transpose the dataframe
df_top5 = df_top5[years].transpose() 

print(df_top5)

Out[48]:

Country  India  China  United Kingdom of Great Britain and Northern Ireland  \
    8880   5123                                              22045      
    8670   6682                                              24796      
    8147   3308                                              20620      
    7338   1863                                              10015      
    5704   1527                                              10170      
    4211   1816                                               9564      
    7150   1960                                               9470      
   10189   2643                                              21337      
   11522   2758                                              27359      
   10343   4323                                              23795      
   12041   8076                                              31668      
   13734  14255                                              23380      
   13673  10846                                              34123      
   21496   9817                                              33720      
   18620  13128                                              39231      
   18489  14398                                              30145      
   23859  19415                                              29322      
   22268  20475                                              22965      
   17241  21049                                              10367      
   18974  30069                                               7045      
   28572  35529                                               8840      
   31223  36434                                              11728      
   31889  31961                                               8046      
   27155  36439                                               6797      
   28235  36619                                               7533      
   36210  42584                                               7258      
   33848  33518                                               7140      
   28742  27642                                               8216      
   28261  30037                                               8979      
   29456  29622                                               8876      
   34235  30391                                               8724      
   27509  28502                                               6204      
   30933  33024                                               6195      
   33087  34129                                               5827      

Country  Philippines  Pakistan  
          6051       978  
          5921       972  
          5249      1201  
          4562       900  
          3801       668  
          3150       514  
          4166       691  
          7360      1072  
          8639      1334  
         11865      2261  
         12509      2470  
         12718      3079  
         13670      4071  
         20479      4777  
         19532      4666  
         15864      4994  
         13692      9125  
         11549     13073  
          8735      9068  
          9734      9979  
         10763     15400  
         13836     16708  
         11707     15110  
         12758     13205  
         14004     13399  
         18139     14314  
         18400     13127  
         19837     10124  
         24887      8994  
         28573      7217  
         38617      6811  
         36765      7468  
         34315     11227  
         29544     12603  

In [49]:





#Step 2: Plot the dataframe. To make the plot more readeable, we will change the size using the `figsize` parameter.
df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting
df_top5.plot(kind='line', figsize=(15, 8)) # pass a tuple (x, y) size



plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')


plt.show()

Out[49]:

Area Plots

Area plots are stacked by default. And to produce a stacked area plot, each column must be either all positive or all negative values (any NaN, i.e. not a number, values will default to 0). To produce an unstacked plot, set parameter stacked to value False.

In [50]:

can.sort_values(['Total'], ascending=False, axis=0, inplace=True)

# get the top 5 entries
df_top3 = can.head(3)

# transpose the dataframe
df_top3 = df_top3[years].transpose()

df_top3.head()

Out[50]:

In [51]:

# let's change the index values of df_top5 to type integer for plotting
df_top3.index = df_top3.index.map(int)
df_top3.plot(kind='area',
             stacked=False,
             figsize=(20, 10))  # pass a tuple (x, y) size

plt.title('Immigration Trend of Top 3 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

plt.show()

Out[51]:

The unstacked plot has a default transparency (alpha value) at 0.5. We can modify this value by passing in the alpha parameter.

Two types of plotting

**Option 1: Scripting layer (procedural method) - using matplotlib.pyplot as 'plt' **

You can use plt i.e. matplotlib.pyplot and add more elements by calling different methods procedurally; for example, plt.title(...) to add title or plt.xlabel(...) to add label to the x-axis.

# Option 1: This is what we have been using so far
df_top5.plot(kind='area', alpha=0.35, figsize=(20, 10)) 
plt.title('Immigration trend of top 5 countries')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

**Option 2: Artist layer (Object oriented method) - using an Axes instance from Matplotlib (preferred) **

You can use an Axes instance of your current plot and store it in a variable (eg. ax). You can add more elements by calling methods with a little change in syntax (by adding "set_" to the previous methods). For example, use ax.set_title() instead of plt.title() to add title, or ax.set_xlabel() instead of plt.xlabel() to add label to the x-axis.

This option sometimes is more transparent and flexible to use for advanced plots (in particular when having multiple plots, as you will see later).

In this course, we will stick to the scripting layer, except for some advanced visualizations where we will need to use the artist layer to manipulate advanced aspects of the plots.

In [53]:

# option 2: preferred option with more flexibility
ax = df_top3.plot(kind='area', alpha=0.9, figsize=(20, 10))

ax.set_title('Immigration Trend of Top 3 Countries')
ax.set_ylabel('Number of Immigrants')

ax.set_xlabel('Years')

Out[53]:

Text(0.5, 0, 'Years')

Question: Use the scripting layer to create a stacked area plot of the 5 countries that contributed the least to immigration to Canada from 1980 to 2013. Use a transparency value of 0.45.

In [ ]:

#The correct answer is:
    # get the 5 countries with the least contribution
df_least5 = can.tail(5)

    # transpose the dataframe
df_least5 = df_least5[years].transpose() 
df_least5.head()

df_least5.index = df_least5.index.map(int) # let's change the index values of df_least5 to type integer for plotting
df_least5.plot(kind='area', alpha=0.45, figsize=(20, 10)) 

plt.title('Immigration Trend of 5 Countries with Least Contribution to Immigration')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

plt.show()

Quick Practice

Question: Use the artist layer to create an unstacked area plot of the 5 countries that contributed the least to immigration to Canada from 1980 to 2013. Use a transparency value of 0.55.

In [ ]:

# get the 5 countries with the least contribution
df_least5 = can.tail(5)

    # transpose the dataframe
df_least5 = df_least5[years].transpose() 
    
df_least5.head()

df_least5.index = df_least5.index.map(int) # let's change the index values of df_least5 to type integer for plotting
    
ax = df_least5.plot(kind='area', alpha=0.55, stacked=False, figsize=(20, 10))
    
ax.set_title('Immigration Trend of 5 Countries with Least Contribution to Immigration')
ax.set_ylabel('Number of Immigrants')
ax.set_xlabel('Years')

Histograms

A histogram is a way of representing the frequency distribution of numeric dataset. The way it works is it partitions the x-axis into bins, assigns each data point in our dataset to a bin, and then counts the number of data points that have been assigned to each bin. So the y-axis is the frequency or the number of data points in each bin. Note that we can change the bin size and usually one needs to tweak it so that the distribution is displayed nicely.

Question: What is the frequency distribution of the number (population) of new immigrants from the various countries to Canada in 2013?

Before we proceed with creating the histogram plot, let's first examine the data split into intervals. To do this, we will us Numpy's histrogram method to get the bin ranges and frequency counts as follows:


Data Range: 10-25

10-15: Child (100)
15:18: Teenagers(50)
18:20: Young(40)

In [54]:

# let's quickly view the 2013 data
can['2013'].head(10)

Out[54]:

Country
India                                                   33087
China                                                   34129
United Kingdom of Great Britain and Northern Ireland     5827
Philippines                                             29544
Pakistan                                                12603
United States of America                                 8501
Iran (Islamic Republic of)                              11291
Sri Lanka                                                2394
Republic of Korea                                        4509
Poland                                                    852
Name: 2013, dtype: int64

In [55]:

# np.histogram returns 2 values  10 100   
import numpy as np
count, bin_edges = np.histogram(can['2013'])

print(count) # frequency count
print(bin_edges) # bin ranges, default = 10 bins

Out[55]:

[178  11   1   2   0   0   0   0   1   2]
[    0.   3412.9  6825.8 10238.7 13651.6 17064.5 20477.4 23890.3 27303.2
 30716.1 34129. ]

By default, the `histrogram` method breaks up the dataset into 10 bins. The figure below summarizes the bin ranges and the frequency distribution of immigration in 2013. We can see that in 2013:

178 countries contributed between 0 to 3412.9 immigrants
11 countries contributed between 3412.9 to 6825.8 immigrants
1 country contributed between 6285.8 to 10238.7 immigrants, and so on.. We can easily graph this distribution by passing kind=hist to plot().

In [56]:

can['2013'].plot(kind='hist', figsize=(15, 6))

# add a title to the histogram
plt.title('Histogram of Immigration from 195 Countries in 2013')
# add y-label
plt.ylabel('Number of Countries')
# add x-label
plt.xlabel('Number of Immigrants')

plt.show()

Out[56]:

In the above plot, the x-axis represents the population range of immigrants in intervals of 3412.9. The y-axis represents the number of countries that contributed to the aforementioned population.

Notice that the x-axis labels do not match with the bin size. This can be fixed by passing in a xticks keyword that contains the list of the bin sizes, as follows:

In [57]:

# 'bin_edges' is a list of bin intervals
count, bin_edges = np.histogram(can['2013'])

can['2013'].plot(kind='hist', figsize=(15, 5), xticks=bin_edges)

plt.title('Histogram of Immigration from 195 countries in 2013') # add a title to the histogram
plt.ylabel('Number of Countries') # add y-label
plt.xlabel('Number of Immigrants') # add x-label

plt.show()

Out[57]:

We can also plot multiple histograms on the same plot. For example, let's try to answer the following questions using a histogram.

Question: What is the immigration distribution for Denmark, Norway, and Sweden for years 1980 - 2013?

In [58]:

# let's quickly view the dataset 
can.loc[['Denmark', 'Norway', 'Sweden'], years]
# generate histogram
can.loc[['Denmark', 'Norway', 'Sweden'], years].plot.hist()

Out[58]:

<AxesSubplot:ylabel='Frequency'>

Instead of plotting the population frequency distribution of the population for the 3 countries, pandas instead plotted the population frequency distribution for the `years`.

This can be easily fixed by first transposing the dataset, and then plotting as shown below.

In [59]:

# transpose dataframe
df_t = can.loc[['Denmark', 'Norway', 'Sweden'], years].transpose()
df_t.head()

Out[59]:

In [60]:

# generate histogram
df_t.plot(kind='hist', figsize=(15, 6))

plt.title('Histogram of Immigration from Denmark, Norway, and Sweden between 1980 - 2013')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants')

plt.show()

Out[60]:

Let's make a few modifications to improve the impact and aesthetics of the previous plot:

increase the bin size to 15 by passing in bins parameter;
set transparency to 60% by passing in alpha parameter;
label the x-axis by passing in x-label parameter;
change the colors of the plots by passing in color parameter.

In [62]:

# let's get the x-tick values
count, bin_edges = np.histogram(df_t, 15)

# un-stacked histogram
df_t.plot(kind ='hist', 
          figsize=(12, 6),
          bins=15,
          alpha=0.8,
          xticks=bin_edges,
          color=['yellow', 'green', 'pink']
         )

plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants')

plt.show()

Out[62]:

If we do not want the plots to overlap each other, we can stack them using the stacked parameter. Let's also adjust the min and max x-axis labels to remove the extra gap on the edges of the plot. We can pass a tuple (min,max) using the xlim paramater, as show below.

In [65]:

count, bin_edges = np.histogram(df_t, 15)
xmin = bin_edges[0] - 10   #  first bin value is 31.0, adding buffer of 10 for aesthetic purposes 
xmax = bin_edges[-1] + 10  #  last bin value is 308.0, adding buffer of 10 for aesthetic purposes

# stacked Histogram
df_t.plot(kind='hist',
          figsize=(15, 10), 
          bins=15,
          xticks=bin_edges,
          color=['coral', 'darkred', 'mediumseagreen'],
          stacked=True,
          xlim=(xmin, xmax)
         )

plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants') 

plt.show()

Out[65]:

Question: Write a code to display the immigration distribution for Greece, Albania, and Bulgaria for years 1980 - 2013? Use an overlapping plot with 15 bins and a transparency value of 0.35.

In [ ]:


# create a dataframe of the countries of interest (cof)
df_cof = can.loc[['Greece', 'Albania', 'Bulgaria'], years]

    # transpose the dataframe
df_cof = df_cof.transpose() 

    # let's get the x-tick values
count, bin_edges = np.histogram(df_cof, 15)

    # Un-stacked Histogram
df_cof.plot(kind ='hist',
                figsize=(10, 6),
                bins=15,
                alpha=0.35,
                xticks=bin_edges,
                color=['coral', 'darkslateblue', 'mediumseagreen']
                )

plt.title('Histogram of Immigration from Greece, Albania, and Bulgaria from 1980 - 2013')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants')

plt.show()

Bar Charts

A bar plot is a way of representing data where the length of the bars represents the magnitude/size of the feature/variable. Bar graphs usually represent numerical and categorical variables grouped in intervals.

To create a bar plot, we can pass one of two arguments via kind parameter in plot():

kind=bar creates a vertical bar plot
kind=barh creates a horizontal bar plot

Compare the number of Icelandic immigrants (country = 'Iceland') to Canada from year 1980 to 2013.

In [66]:

# step 1: get the data
df_iceland = can.loc['Iceland', years]
df_iceland.head()

Out[66]:

  17
  33
  10
   9
  13
Name: Iceland, dtype: object

In [67]:

# step 2: plot data
df_iceland.plot(kind='barh', figsize=(15, 7),color="darkgreen",alpha=0.90)
plt.xlabel('Year') # add to x-label to the plot
plt.ylabel('Number of immigrants') # add y-label to the plot
plt.title('Icelandic immigrants to Canada from 1980 to 2013') # add title to the plot

plt.show()

Out[67]:

The bar plot above shows the total number of immigrants broken down by each year. We can clearly see the impact of the financial crisis; the number of immigrants to Canada started increasing rapidly after 2008.

Let's annotate this on the plot using the annotate method of the scripting layer or the pyplot interface. We will pass in the following parameters:

s: str, the text of annotation.
xy: Tuple specifying the (x,y) point to annotate (in this case, end point of arrow).
xytext: Tuple specifying the (x,y) point to place the text (in this case, start point of arrow).
xycoords: The coordinate system that xy is given in - 'data' uses the coordinate system of the object being annotated (default).
arrowprops: Takes a dictionary of properties to draw the arrow:
arrowstyle: Specifies the arrow style, '->' is standard arrow.
connectionstyle: Specifies the connection type. arc3 is a straight line.
color: Specifies color of arrow.
lw: Specifies the line width read the Matplotlib documentation for more details on annotations: http://matplotlib.orsg/api/pyplot_api.html#matplotlib.pyplot.annotate.

In [68]:

df_iceland.plot(kind='bar', figsize=(15, 8), rot=90,color="pink")  # rotate the xticks(labelled points on x-axis) by 90 degrees

plt.xlabel('Year')
plt.ylabel('Number of Immigrants')
plt.title('Icelandic Immigrants to Canada from 1980 to 2013')

# Annotate arrow
plt.annotate('Increase Rate of Immigration',  # s: str. Will leave it blank for no text
             xy=(33, 72),  # place head of the arrow at point (year 2012 , pop 70)
             xytext=(27, 20),  # place base of the arrow at point (year 2008 , pop 20)
             xycoords='data',  # will use the coordinate system of the object being annotated
             arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='darkred')
             )

plt.show()

Out[68]:

Horizontal Bar Plot

Sometimes it is more practical to represent the data horizontally, especially if you need more room for labelling the bars. In horizontal bar graphs, the y-axis is used for labelling, and the length of bars on the x-axis corresponds to the magnitude of the variable being measured. As you will see, there is more room on the y-axis to label categorical variables.

Question: Using the scripting later and the df_can dataset, create a horizontal bar plot showing the total number of immigrants to Canada from the top 15 countries, for the period 1980 - 2013. Label each country with the total immigrant count.

In [ ]:

### type your answer here


can.sort_values(by='Total', ascending=True, inplace=True)

    # get top 15 countries
df_top15 = can['Total'].tail(15)
df_top15

Step 2: Plot data:

Use kind='barh' to generate a bar chart with horizontal bars.
Make sure to choose a good size for the plot and to label your axes and to give the plot a title.
Loop through the countries and annotate the immigrant population using the anotate function of the scripting interface.

In [ ]:

df_top15.plot(kind='barh', figsize=(15, 15), color='Green')
plt.xlabel('Number of Immigrants')
plt.title('Top 15 Conuntries Contributing to the Immigration to Canada between 1980 - 2013')

Quick Practice:

Please compare Immigration for Japan and China
Create a viz to show immigration for bottom 5 countries
Compare immigration rate for Asia and S. Africa

In [ ]: