Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Data Science using Python/Day 4 Data Visualization.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Data Visualization

Objectives

  • Create Data Visualization with Python

  • Use various Python libraries for visualization

The Dataset: Immigration to Canada from 1980 to 2013

import numpy as np # useful for many scientific computing in Python import pandas as pd # primary data structure library can=pd.read_csv("canadian_im.csv")

Let's view the top 2 rows of the dataset using the head() function.

can.head(2) #can.tail(2) # tip: You can specify the number of rows you'd like to see as follows: can.tail(2)

When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We can do this by using the info() method.

This method can be used to get a short summary of the dataframe.

can.shape
(195, 39)

Insight

  • We have immgration information for 195 Countries form 1980 - 2013

can.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195 entries, 0 to 194 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 195 non-null object 1 Continent 195 non-null object 2 Region 195 non-null object 3 DevName 195 non-null object 4 1980 195 non-null int64 5 1981 195 non-null int64 6 1982 195 non-null int64 7 1983 195 non-null int64 8 1984 195 non-null int64 9 1985 195 non-null int64 10 1986 195 non-null int64 11 1987 195 non-null int64 12 1988 195 non-null int64 13 1989 195 non-null int64 14 1990 195 non-null int64 15 1991 195 non-null int64 16 1992 195 non-null int64 17 1993 195 non-null int64 18 1994 195 non-null int64 19 1995 195 non-null int64 20 1996 195 non-null int64 21 1997 195 non-null int64 22 1998 195 non-null int64 23 1999 195 non-null int64 24 2000 195 non-null int64 25 2001 195 non-null int64 26 2002 195 non-null int64 27 2003 195 non-null int64 28 2004 195 non-null int64 29 2005 195 non-null int64 30 2006 195 non-null int64 31 2007 195 non-null int64 32 2008 195 non-null int64 33 2009 195 non-null int64 34 2010 195 non-null int64 35 2011 195 non-null int64 36 2012 195 non-null int64 37 2013 195 non-null int64 38 Total 195 non-null int64 dtypes: int64(35), object(4) memory usage: 59.5+ KB

To get the list of column headers we can call upon the data frame's columns instance variable.

can.columns
Index(['Country', 'Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', 'Total'], dtype='object')

Similarly, to get the list of indices we use the .index instance variables.

can.index
RangeIndex(start=0, stop=195, step=1)

To get the index and columns as lists, we can use the tolist() method.

can.columns.tolist()
['Country', 'Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', 'Total']
can.index.tolist()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194]
print(type(can.columns.tolist())) print(type(can.index.tolist()))
<class 'list'> <class 'list'>

Statistical Analysis

can.isnull().sum()
Country 0 Continent 0 Region 0 DevName 0 1980 0 1981 0 1982 0 1983 0 1984 0 1985 0 1986 0 1987 0 1988 0 1989 0 1990 0 1991 0 1992 0 1993 0 1994 0 1995 0 1996 0 1997 0 1998 0 1999 0 2000 0 2001 0 2002 0 2003 0 2004 0 2005 0 2006 0 2007 0 2008 0 2009 0 2010 0 2011 0 2012 0 2013 0 Total 0 dtype: int64

Finally, let's view a quick summary of each column in our dataframe using the describe() method.

can.describe()
can.describe(include="object")

Insights

  • Total 195 Countries

  • 22 Different Regions and highest immigration to Canada is form Western Asia

  • 78% of immigration to Canada is form Developing Regions


pandas Intermediate: Indexing and Selection (slicing)

Select Column

There are two ways to filter on a column name:

Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.

df.column_name # returns series

Method 2: More robust, and can filter on multiple columns.

df['column'] # returns series
df[['column 1', 'column 2']] # returns dataframe

Example: Let's try filtering on the list of countries ('Country').

can.Country # returns a series
0 Afghanistan 1 Albania 2 Algeria 3 American Samoa 4 Andorra ... 190 Viet Nam 191 Western Sahara 192 Yemen 193 Zambia 194 Zimbabwe Name: Country, Length: 195, dtype: object
a=can[["Country"]] a

Let's try filtering on the list of countries ('Country') and the data for years: 1980 - 1985.

can[['Country', '1980', '1981', '1982', '1983', '1984', '1985']] # returns a dataframe # notice that 'Country' is string, and the years are strings.

Select Row

There are main 2 ways to select rows:

df.loc[label] # filters by the labels of the index/column df.iloc[index] # filters by the positions of the index/column

Before we proceed, notice that the default index of the dataset is a numeric range from 0 to 194. This makes it very difficult to do a query by a specific country. For example to search for data on Japan, we need to know the corresponding index value.

This can be fixed very easily by setting the 'Country' column as the index using set_index() method.

can.iloc[3:6]

Setting Country as index column

a1=[1,2,3] a1[-1] a2=pd.Series(a1) a2
0 1 1 2 2 3 dtype: int64
can.head()
can.set_index('Country', inplace=True) # tip: The opposite of set is reset. So to reset the index, we can use df_can.reset_index()
can.tail(3)
# optional: to remove the name of the index can.index.name = None

Example: Let's view the number of immigrants from Japan (row 87) for the following scenarios:

  • 1. The full row data (all columns)

  • 2. For year 2013

  • 3. For years 1980 to 1985

# 1. the full row data (all columns) can.loc['Japan']
Continent Asia Region Eastern Asia DevName Developed regions 1980 701 1981 756 1982 598 1983 309 1984 246 1985 198 1986 248 1987 422 1988 324 1989 494 1990 379 1991 506 1992 605 1993 907 1994 956 1995 826 1996 994 1997 924 1998 897 1999 1083 2000 1010 2001 1092 2002 806 2003 817 2004 973 2005 1067 2006 1212 2007 1250 2008 1284 2009 1194 2010 1168 2011 1265 2012 1214 2013 982 Total 27707 Name: Japan, dtype: object
# alternate methods can.iloc[87]
can[can.index == 'Japan']

Quick Parctice

  • How many number of immigrants from india in 2000?

  • How many immigrants from China in 1998?

  • Number of immigrants from India between 1988-2000?

# 2. for year 2013 can.loc['Japan', '2013']
982
# alternate method # year 2013 is the last column, with a positional index of 36 can.iloc[87, 36]
# 3. for years 1980 to 1985 can.loc['Japan', ['1980', '1981', '1982', '1983', '1984', '1985']]
1980 701 1981 756 1982 598 1983 309 1984 246 1985 198 Name: Japan, dtype: object
# Alternative Method can.iloc[87, [3, 4, 5, 6, 7, 8]]

Column names that are integers (such as the years) might introduce some confusion. For example, when we are referencing the year 2013, one might confuse that when the 2013th positional index.

To avoid this ambuigity, let's convert the column names into strings: '1980' to '2013'.

can.columns = list(map(str, can.columns)) # [print (type(x)) for x in df_can.columns.values] #<-- uncomment to check type of column headers

Since we converted the years to string, let's declare a variable that will allow us to easily call upon the full range of years:

# useful for plotting later on years = list(map(str, range(1980, 2014))) years
['1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013']

Filtering based on a criteria

To filter the dataframe based on a condition, we simply pass the condition as a boolean vector.

For example, Let's filter the dataframe to show the data on Asian countries (AreaName = Asia).

# 1. create the condition boolean series Asian_Info = can['Continent'] == 'Asia' print(Asian_Info )
Country Afghanistan True Albania False Algeria False American Samoa False Andorra False ... Viet Nam True Western Sahara False Yemen True Zambia False Zimbabwe False Name: Continent, Length: 195, dtype: bool
# 2. pass this condition into the dataFrame can[Asian_Info]

QP

Create a condition to filter Africa data

# we can pass multiple criteria in the same line. # let's filter for AreaNAme = Asia and RegName = Southern Asia # | - OR , 1,0 - 1 , 0,1 - 1, 1,1- 1 can[(can['Continent']=='Asia') & (can['Region']=='Southern Asia')] # note: When using 'and' and 'or' operators, pandas requires we use '&' and '|' instead of 'and' and 'or' # don't forget to enclose the two conditions in parentheses

Before we proceed: let's review the changes we have made to our dataframe.

print('data dimensions:', can.shape) print(can.columns) can.head(2)
data dimensions: (195, 38) Index(['Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', 'Total'], dtype='object')

Visualizing Data using Matplotlib

Matplotlib: Standard Python Visualization Library

The primary plotting library we will explore in the course is Matplotlib. As mentioned on their website:

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

If you are aspiring to create impactful visualization with python, Matplotlib is an essential tool to have at your disposal.

Matplotlib.Pyplot

One of the core aspects of Matplotlib is matplotlib.pyplot

Let's start by importing matplotlib and matplotlib.pyplot as follows:

# we are using the inline backend %matplotlib inline import matplotlib as mpl import matplotlib.pyplot as plt

*optional: check if Matplotlib is loaded.

print('Matplotlib version: ', mpl.__version__) # >= 2.0.0
Matplotlib version: 3.5.1
print(plt.style.available) mpl.style.use(['ggplot']) # optional: for ggplot-like style
['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']

Plotting in pandas

Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in pandas is as simple as appending a .plot() method to a series or dataframe.

Documentation:

Line PLots (Series/Dataframe)

What is a line plot and why use it?

A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.

Let's start with a case study:

In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The quake caused widespread devastation and loss of life and aout three million people were affected by this natural disaster. As part of Canada's humanitarian effort, the Government of Canada stepped up its effort in accepting refugees from Haiti. We can quickly visualize this effort using a Line plot:

Question: Plot a line graph of immigration from Haiti using df.plot().

(Q.P - Plot a line graph of immigration from China using `df.plot())

First, we will extract the data series for Haiti.

haiti = can.loc['Haiti', years] # passing in years 1980 - 2013 to exclude the 'total' column haiti.head()
1980 1666 1981 3692 1982 3498 1983 2860 1984 1418 Name: Haiti, dtype: object

Next, we will plot a line plot by appending .plot() to the haiti dataframe.

haiti.plot()
<AxesSubplot:>
Image in a Jupyter notebook

pandas automatically populated the x-axis with the index values (years), and the y-axis with the column values (population). However, notice how the years were not displayed because they are of type string. Therefore, let's change the type of the index values to integer for plotting.

Also, let's label the x and y axis using plt.title(), plt.ylabel(), and plt.xlabel() as follows:

haiti.index = haiti.index.map(int) # let's change the index values of Haiti to type integer for plotting haiti.plot(kind='line') plt.title('Immigration from Haiti') plt.ylabel('Number of immigrants') plt.xlabel('Years') plt.show() # need this line to show the updates made to the figure
Image in a Jupyter notebook

We can clearly notice how number of immigrants from Haiti spiked up from 2010 as Canada stepped up its efforts to accept refugees from Haiti. Let's annotate this spike in the plot by using the plt.text() method.

haiti.plot(kind='line', figsize=(16, 8)) plt.title('Immigration from Haiti') plt.ylabel('Number of Immigrants') plt.xlabel('Years') # annotate the 2010 Earthquake. # syntax: plt.text(x, y, label) plt.text(1985,3800, "peak of Immigration") plt.text(2009, 6500, '2010 Earthquake') # see note below plt.show()
Image in a Jupyter notebook

With just a few lines of code, you were able to quickly identify and visualize the spike in immigration!

Quick note on x and y values in plt.text(x, y, label):

Since the x-axis (years) is type 'integer', we specified x as a year. The y axis (number of immigrants) is type 'integer', so we can just specify the value y = 6000.
plt.text(2000, 6000, '2010 Earthquake') # years stored as type int
If the years were stored as type 'string', we would need to specify x as the index position of the year. Eg 20th index is year 2000 since it is the 20th year with a base year of 1980.
plt.text(20, 6000, '2010 Earthquake') # years stored as type int
We will cover advanced annotation methods in later modules.

We can easily add more countries to line plot to make meaningful comparisons immigration from different countries.

Question: Let's compare the number of immigrants from India and China from 1980 to 2013.

Q.P: "Compare immigrants from Srilanka and Bhutan from 1980 tO 2013

Step 1: Get the data set for China and India, and display the dataframe.

df_CI = can.loc[['India', 'China'], years] df_CI

Step 2: Plot graph. We will explicitly specify line plot by passing in kind parameter to plot().

df_CI.plot(kind='line')
<AxesSubplot:xlabel='Country'>
Image in a Jupyter notebook

That doesn't look right...

Recall that pandas plots the indices on the x-axis and the columns as individual lines on the y-axis. Since df_CI is a dataframe with the country as the index and years as the columns, we must first transpose the dataframe using transpose() method to swap the row and columns.

df_CI = df_CI.transpose() df_CI.head(2)

pandas will auomatically graph the two countries on the same graph. Go ahead and plot the new transposed dataframe. Make sure to add a title to the plot and label the axes.

df_CI.index = df_CI.index.map(int) # let's change the index values of df_CI to type integer for plotting df_CI.plot(kind='line',figsize=(15, 8)) plt.title('Immigrants from China and India') plt.ylabel('Number of Immigrants') plt.xlabel('Years') plt.show()
Image in a Jupyter notebook


From the above plot, we can observe that the China and India have very similar immigration trends through the years.

Note: How come we didn't need to transpose Haiti's dataframe before plotting (like we did for df_CI)?

That's because haiti is a series as opposed to a dataframe, and has the years as its indices as shown below.

print(type(haiti)) print(haiti.head(5))

class 'pandas.core.series.Series'
1980 1666
1981 3692
1982 3498
1983 2860
1984 1418
Name: Haiti, dtype: int64

Line plot is a handy tool to display several dependent variables against one independent variable. However, it is recommended that no more than 5-10 lines on a single graph; any more than that and it becomes difficult to interpret.

Question: Compare the trend of top 5 countries that contributed the most to immigration to Canada.

Q.P: Compare the trend of bottom 5 countries that contributed least to immigration to Canada.

#The correct answer is: #Step 1: Get the dataset. Recall that we created a Total column that calculates cumulative immigration by country. #We will sort on this column to get our top 5 countries using pandas sort_values() method. inplace = True # paramemter saves the changes to the original can dataframe can.sort_values(by='Total', ascending=False, axis=0, inplace=True) # get the top 5 entries df_top5 = can.head(5) # transpose the dataframe df_top5 = df_top5[years].transpose() print(df_top5)
Country India China United Kingdom of Great Britain and Northern Ireland \ 1980 8880 5123 22045 1981 8670 6682 24796 1982 8147 3308 20620 1983 7338 1863 10015 1984 5704 1527 10170 1985 4211 1816 9564 1986 7150 1960 9470 1987 10189 2643 21337 1988 11522 2758 27359 1989 10343 4323 23795 1990 12041 8076 31668 1991 13734 14255 23380 1992 13673 10846 34123 1993 21496 9817 33720 1994 18620 13128 39231 1995 18489 14398 30145 1996 23859 19415 29322 1997 22268 20475 22965 1998 17241 21049 10367 1999 18974 30069 7045 2000 28572 35529 8840 2001 31223 36434 11728 2002 31889 31961 8046 2003 27155 36439 6797 2004 28235 36619 7533 2005 36210 42584 7258 2006 33848 33518 7140 2007 28742 27642 8216 2008 28261 30037 8979 2009 29456 29622 8876 2010 34235 30391 8724 2011 27509 28502 6204 2012 30933 33024 6195 2013 33087 34129 5827 Country Philippines Pakistan 1980 6051 978 1981 5921 972 1982 5249 1201 1983 4562 900 1984 3801 668 1985 3150 514 1986 4166 691 1987 7360 1072 1988 8639 1334 1989 11865 2261 1990 12509 2470 1991 12718 3079 1992 13670 4071 1993 20479 4777 1994 19532 4666 1995 15864 4994 1996 13692 9125 1997 11549 13073 1998 8735 9068 1999 9734 9979 2000 10763 15400 2001 13836 16708 2002 11707 15110 2003 12758 13205 2004 14004 13399 2005 18139 14314 2006 18400 13127 2007 19837 10124 2008 24887 8994 2009 28573 7217 2010 38617 6811 2011 36765 7468 2012 34315 11227 2013 29544 12603
#Step 2: Plot the dataframe. To make the plot more readeable, we will change the size using the `figsize` parameter. df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting df_top5.plot(kind='line', figsize=(15, 8)) # pass a tuple (x, y) size plt.title('Immigration Trend of Top 5 Countries') plt.ylabel('Number of Immigrants') plt.xlabel('Years') plt.show()
Image in a Jupyter notebook

Area Plots

Area plots are stacked by default. And to produce a stacked area plot, each column must be either all positive or all negative values (any NaN, i.e. not a number, values will default to 0). To produce an unstacked plot, set parameter stacked to value False.

can.sort_values(['Total'], ascending=False, axis=0, inplace=True) # get the top 5 entries df_top3 = can.head(3) # transpose the dataframe df_top3 = df_top3[years].transpose() df_top3.head()
# let's change the index values of df_top5 to type integer for plotting df_top3.index = df_top3.index.map(int) df_top3.plot(kind='area', stacked=False, figsize=(20, 10)) # pass a tuple (x, y) size plt.title('Immigration Trend of Top 3 Countries') plt.ylabel('Number of Immigrants') plt.xlabel('Years') plt.show()
Image in a Jupyter notebook

The unstacked plot has a default transparency (alpha value) at 0.5. We can modify this value by passing in the alpha parameter.

Two types of plotting

  • **Option 1: Scripting layer (procedural method) - using matplotlib.pyplot as 'plt' **

You can use plt i.e. matplotlib.pyplot and add more elements by calling different methods procedurally; for example, plt.title(...) to add title or plt.xlabel(...) to add label to the x-axis.

# Option 1: This is what we have been using so far df_top5.plot(kind='area', alpha=0.35, figsize=(20, 10)) plt.title('Immigration trend of top 5 countries') plt.ylabel('Number of immigrants') plt.xlabel('Years')
  • **Option 2: Artist layer (Object oriented method) - using an Axes instance from Matplotlib (preferred) **

You can use an Axes instance of your current plot and store it in a variable (eg. ax). You can add more elements by calling methods with a little change in syntax (by adding "set_" to the previous methods). For example, use ax.set_title() instead of plt.title() to add title, or ax.set_xlabel() instead of plt.xlabel() to add label to the x-axis.

This option sometimes is more transparent and flexible to use for advanced plots (in particular when having multiple plots, as you will see later).

In this course, we will stick to the scripting layer, except for some advanced visualizations where we will need to use the artist layer to manipulate advanced aspects of the plots.

# option 2: preferred option with more flexibility ax = df_top3.plot(kind='area', alpha=0.9, figsize=(20, 10)) ax.set_title('Immigration Trend of Top 3 Countries') ax.set_ylabel('Number of Immigrants') ax.set_xlabel('Years')
Text(0.5, 0, 'Years')
Image in a Jupyter notebook

Question: Use the scripting layer to create a stacked area plot of the 5 countries that contributed the least to immigration to Canada from 1980 to 2013. Use a transparency value of 0.45.

#The correct answer is: # get the 5 countries with the least contribution df_least5 = can.tail(5) # transpose the dataframe df_least5 = df_least5[years].transpose() df_least5.head() df_least5.index = df_least5.index.map(int) # let's change the index values of df_least5 to type integer for plotting df_least5.plot(kind='area', alpha=0.45, figsize=(20, 10)) plt.title('Immigration Trend of 5 Countries with Least Contribution to Immigration') plt.ylabel('Number of Immigrants') plt.xlabel('Years') plt.show()

Quick Practice

Question: Use the artist layer to create an unstacked area plot of the 5 countries that contributed the least to immigration to Canada from 1980 to 2013. Use a transparency value of 0.55.

# get the 5 countries with the least contribution df_least5 = can.tail(5) # transpose the dataframe df_least5 = df_least5[years].transpose() df_least5.head() df_least5.index = df_least5.index.map(int) # let's change the index values of df_least5 to type integer for plotting ax = df_least5.plot(kind='area', alpha=0.55, stacked=False, figsize=(20, 10)) ax.set_title('Immigration Trend of 5 Countries with Least Contribution to Immigration') ax.set_ylabel('Number of Immigrants') ax.set_xlabel('Years')

Histograms

A histogram is a way of representing the frequency distribution of numeric dataset. The way it works is it partitions the x-axis into bins, assigns each data point in our dataset to a bin, and then counts the number of data points that have been assigned to each bin. So the y-axis is the frequency or the number of data points in each bin. Note that we can change the bin size and usually one needs to tweak it so that the distribution is displayed nicely.

Question: What is the frequency distribution of the number (population) of new immigrants from the various countries to Canada in 2013?

Before we proceed with creating the histogram plot, let's first examine the data split into intervals. To do this, we will us Numpy's histrogram method to get the bin ranges and frequency counts as follows:

Data Range: 10-25 10-15: Child (100) 15:18: Teenagers(50) 18:20: Young(40)
# let's quickly view the 2013 data can['2013'].head(10)
Country India 33087 China 34129 United Kingdom of Great Britain and Northern Ireland 5827 Philippines 29544 Pakistan 12603 United States of America 8501 Iran (Islamic Republic of) 11291 Sri Lanka 2394 Republic of Korea 4509 Poland 852 Name: 2013, dtype: int64
# np.histogram returns 2 values 10 100 import numpy as np count, bin_edges = np.histogram(can['2013']) print(count) # frequency count print(bin_edges) # bin ranges, default = 10 bins
[178 11 1 2 0 0 0 0 1 2] [ 0. 3412.9 6825.8 10238.7 13651.6 17064.5 20477.4 23890.3 27303.2 30716.1 34129. ]

By default, the histrogram method breaks up the dataset into 10 bins. The figure below summarizes the bin ranges and the frequency distribution of immigration in 2013. We can see that in 2013:

  • 178 countries contributed between 0 to 3412.9 immigrants

  • 11 countries contributed between 3412.9 to 6825.8 immigrants

  • 1 country contributed between 6285.8 to 10238.7 immigrants, and so on.. We can easily graph this distribution by passing kind=hist to plot().

can['2013'].plot(kind='hist', figsize=(15, 6)) # add a title to the histogram plt.title('Histogram of Immigration from 195 Countries in 2013') # add y-label plt.ylabel('Number of Countries') # add x-label plt.xlabel('Number of Immigrants') plt.show()
Image in a Jupyter notebook
  • In the above plot, the x-axis represents the population range of immigrants in intervals of 3412.9. The y-axis represents the number of countries that contributed to the aforementioned population.

Notice that the x-axis labels do not match with the bin size. This can be fixed by passing in a xticks keyword that contains the list of the bin sizes, as follows:

# 'bin_edges' is a list of bin intervals count, bin_edges = np.histogram(can['2013']) can['2013'].plot(kind='hist', figsize=(15, 5), xticks=bin_edges) plt.title('Histogram of Immigration from 195 countries in 2013') # add a title to the histogram plt.ylabel('Number of Countries') # add y-label plt.xlabel('Number of Immigrants') # add x-label plt.show()
Image in a Jupyter notebook

We can also plot multiple histograms on the same plot. For example, let's try to answer the following questions using a histogram.

Question: What is the immigration distribution for Denmark, Norway, and Sweden for years 1980 - 2013?

# let's quickly view the dataset can.loc[['Denmark', 'Norway', 'Sweden'], years] # generate histogram can.loc[['Denmark', 'Norway', 'Sweden'], years].plot.hist()
<AxesSubplot:ylabel='Frequency'>
Image in a Jupyter notebook

Instead of plotting the population frequency distribution of the population for the 3 countries, pandas instead plotted the population frequency distribution for the years.

This can be easily fixed by first transposing the dataset, and then plotting as shown below.

# transpose dataframe df_t = can.loc[['Denmark', 'Norway', 'Sweden'], years].transpose() df_t.head()
# generate histogram df_t.plot(kind='hist', figsize=(15, 6)) plt.title('Histogram of Immigration from Denmark, Norway, and Sweden between 1980 - 2013') plt.ylabel('Number of Years') plt.xlabel('Number of Immigrants') plt.show()
Image in a Jupyter notebook

Let's make a few modifications to improve the impact and aesthetics of the previous plot:

  • increase the bin size to 15 by passing in bins parameter;

  • set transparency to 60% by passing in alpha parameter;

  • label the x-axis by passing in x-label parameter;

  • change the colors of the plots by passing in color parameter.

# let's get the x-tick values count, bin_edges = np.histogram(df_t, 15) # un-stacked histogram df_t.plot(kind ='hist', figsize=(12, 6), bins=15, alpha=0.8, xticks=bin_edges, color=['yellow', 'green', 'pink'] ) plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013') plt.ylabel('Number of Years') plt.xlabel('Number of Immigrants') plt.show()
Image in a Jupyter notebook

If we do not want the plots to overlap each other, we can stack them using the stacked parameter. Let's also adjust the min and max x-axis labels to remove the extra gap on the edges of the plot. We can pass a tuple (min,max) using the xlim paramater, as show below.

count, bin_edges = np.histogram(df_t, 15) xmin = bin_edges[0] - 10 # first bin value is 31.0, adding buffer of 10 for aesthetic purposes xmax = bin_edges[-1] + 10 # last bin value is 308.0, adding buffer of 10 for aesthetic purposes # stacked Histogram df_t.plot(kind='hist', figsize=(15, 10), bins=15, xticks=bin_edges, color=['coral', 'darkred', 'mediumseagreen'], stacked=True, xlim=(xmin, xmax) ) plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013') plt.ylabel('Number of Years') plt.xlabel('Number of Immigrants') plt.show()
Image in a Jupyter notebook

Question: Write a code to display the immigration distribution for Greece, Albania, and Bulgaria for years 1980 - 2013? Use an overlapping plot with 15 bins and a transparency value of 0.35.

# create a dataframe of the countries of interest (cof) df_cof = can.loc[['Greece', 'Albania', 'Bulgaria'], years] # transpose the dataframe df_cof = df_cof.transpose() # let's get the x-tick values count, bin_edges = np.histogram(df_cof, 15) # Un-stacked Histogram df_cof.plot(kind ='hist', figsize=(10, 6), bins=15, alpha=0.35, xticks=bin_edges, color=['coral', 'darkslateblue', 'mediumseagreen'] ) plt.title('Histogram of Immigration from Greece, Albania, and Bulgaria from 1980 - 2013') plt.ylabel('Number of Years') plt.xlabel('Number of Immigrants') plt.show()

Bar Charts

A bar plot is a way of representing data where the length of the bars represents the magnitude/size of the feature/variable. Bar graphs usually represent numerical and categorical variables grouped in intervals.

To create a bar plot, we can pass one of two arguments via kind parameter in plot():

  • kind=bar creates a vertical bar plot

  • kind=barh creates a horizontal bar plot

Compare the number of Icelandic immigrants (country = 'Iceland') to Canada from year 1980 to 2013.

# step 1: get the data df_iceland = can.loc['Iceland', years] df_iceland.head()
1980 17 1981 33 1982 10 1983 9 1984 13 Name: Iceland, dtype: object
# step 2: plot data df_iceland.plot(kind='barh', figsize=(15, 7),color="darkgreen",alpha=0.90) plt.xlabel('Year') # add to x-label to the plot plt.ylabel('Number of immigrants') # add y-label to the plot plt.title('Icelandic immigrants to Canada from 1980 to 2013') # add title to the plot plt.show()
Image in a Jupyter notebook

The bar plot above shows the total number of immigrants broken down by each year. We can clearly see the impact of the financial crisis; the number of immigrants to Canada started increasing rapidly after 2008.

Let's annotate this on the plot using the annotate method of the scripting layer or the pyplot interface. We will pass in the following parameters:

  • s: str, the text of annotation.

  • xy: Tuple specifying the (x,y) point to annotate (in this case, end point of arrow).

  • xytext: Tuple specifying the (x,y) point to place the text (in this case, start point of arrow).

  • xycoords: The coordinate system that xy is given in - 'data' uses the coordinate system of the object being annotated (default).

  • arrowprops: Takes a dictionary of properties to draw the arrow:

  • arrowstyle: Specifies the arrow style, '->' is standard arrow.

  • connectionstyle: Specifies the connection type. arc3 is a straight line.

  • color: Specifies color of arrow.

  • lw: Specifies the line width read the Matplotlib documentation for more details on annotations: http://matplotlib.orsg/api/pyplot_api.html#matplotlib.pyplot.annotate.

df_iceland.plot(kind='bar', figsize=(15, 8), rot=90,color="pink") # rotate the xticks(labelled points on x-axis) by 90 degrees plt.xlabel('Year') plt.ylabel('Number of Immigrants') plt.title('Icelandic Immigrants to Canada from 1980 to 2013') # Annotate arrow plt.annotate('Increase Rate of Immigration', # s: str. Will leave it blank for no text xy=(33, 72), # place head of the arrow at point (year 2012 , pop 70) xytext=(27, 20), # place base of the arrow at point (year 2008 , pop 20) xycoords='data', # will use the coordinate system of the object being annotated arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='darkred') ) plt.show()
Image in a Jupyter notebook

Horizontal Bar Plot

Sometimes it is more practical to represent the data horizontally, especially if you need more room for labelling the bars. In horizontal bar graphs, the y-axis is used for labelling, and the length of bars on the x-axis corresponds to the magnitude of the variable being measured. As you will see, there is more room on the y-axis to label categorical variables.

Question: Using the scripting later and the df_can dataset, create a horizontal bar plot showing the total number of immigrants to Canada from the top 15 countries, for the period 1980 - 2013. Label each country with the total immigrant count.

### type your answer here can.sort_values(by='Total', ascending=True, inplace=True) # get top 15 countries df_top15 = can['Total'].tail(15) df_top15

Step 2: Plot data:

  1. Use kind='barh' to generate a bar chart with horizontal bars.

  2. Make sure to choose a good size for the plot and to label your axes and to give the plot a title.

  3. Loop through the countries and annotate the immigrant population using the anotate function of the scripting interface.

df_top15.plot(kind='barh', figsize=(15, 15), color='Green') plt.xlabel('Number of Immigrants') plt.title('Top 15 Conuntries Contributing to the Immigration to Canada between 1980 - 2013')

Quick Practice:

  • Please compare Immigration for Japan and China

  • Create a viz to show immigration for bottom 5 countries

  • Compare immigration rate for Asia and S. Africa