GitHub Repository: suyashi29/python-su
Path: blob/master/Data Visualization using Python/1.Introduction-to-Matplotlib-and-Line-Plots.jupyterlite.ipynb
⁴⁷³⁸ views

Kernel: Python 3 (ipykernel)

Data Visualization

Objectives

Create Data Visualization with Python
Use various Python libraries for visualization

The Dataset: Immigration to Canada from 1980 to 2013

In [1]:

import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library
can=pd.read_csv("https://raw.githubusercontent.com/suyashi29/python-su/refs/heads/master/Data%20Visualization%20using%20Python/canadian_im.csv")

Let's view the top 2 rows of the dataset using the head() function.

In [2]:

can.head(2)
# tip: You can specify the number of rows you'd like to see as follows: can.tail(2)

Out[2]:

When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We can do this by using the info() method.

This method can be used to get a short summary of the dataframe.

In [3]:

can.shape

Out[3]:

(195, 39)

We have 195 rows and 39 columns

In [4]:

can.info()

Out[4]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 39 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 Country    195 non-null    object 
 Continent  195 non-null    object 
 Region     195 non-null    object 
 DevName    179 non-null    object 
 1980       195 non-null    int64  
 1981       195 non-null    int64  
 1982       195 non-null    int64  
 1983       193 non-null    float64
 1984       195 non-null    int64  
 1985       195 non-null    int64  
1986       195 non-null    int64  
1987       195 non-null    int64  
1988       195 non-null    int64  
1989       195 non-null    int64  
1990       195 non-null    int64  
1991       195 non-null    int64  
1992       195 non-null    int64  
1993       195 non-null    int64  
1994       195 non-null    int64  
1995       195 non-null    int64  
1996       195 non-null    int64  
1997       195 non-null    int64  
1998       195 non-null    int64  
1999       195 non-null    int64  
2000       195 non-null    int64  
2001       195 non-null    int64  
2002       195 non-null    int64  
2003       195 non-null    int64  
2004       195 non-null    int64  
2005       195 non-null    int64  
2006       195 non-null    int64  
2007       195 non-null    int64  
2008       195 non-null    int64  
2009       195 non-null    int64  
2010       195 non-null    int64  
2011       195 non-null    int64  
2012       195 non-null    int64  
2013       195 non-null    int64  
Total      195 non-null    int64  
dtypes: float64(1), int64(34), object(4)
memory usage: 59.5+ KB

To get the list of column headers we can call upon the data frame's columns instance variable.

In [5]:

can.columns

Out[5]:

Index(['Country', 'Continent', 'Region', 'DevName', '1980', '1981', '1982',
       '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991',
       '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000',
       '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009',
       '2010', '2011', '2012', '2013', 'Total'],
      dtype='object')

Similarly, to get the list of indices we use the .index instance variables.

In [6]:

can.index

Out[6]:

RangeIndex(start=0, stop=195, step=1)

Note: The default type of intance variables index and columns are NOT list.

To get the index and columns as lists, we can use the tolist() method.

In [7]:

can.columns.tolist()

Out[7]:

['Country',
 'Continent',
 'Region',
 'DevName',
 '1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 'Total']

In [8]:

can.index.tolist()

Out[8]:

In [9]:

print(type(can.columns.tolist()))
print(type(can.index.tolist()))

Out[9]:

<class 'list'>
<class 'list'>

To view the dimensions of the dataframe, we use the shape instance variable of it.

Note: The main types stored in pandas objects are float, int, bool, datetime64[ns], datetime64[ns, tz], timedelta[ns], category, and object (string). In addition, these dtypes have item sizes, e.g. int64 and int32.

Let's clean the data set to remove a few unnecessary columns. We can use pandas drop() method as follows:

Let's rename the columns so that they make sense. We can use rename() method by passing in a dictionary of old and new names as follows:

We will also add a 'Total' column that sums up the total immigrants by country over the entire period 1980 - 2013, as follows:

We can check to see how many null objects we have in the dataset as follows:

In [10]:

can.isnull().sum()

Out[10]:

Country       0
Continent     0
Region        0
DevName      16
        0
        0
        0
        2
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
        0
Total         0
dtype: int64

Finally, let's view a quick summary of each column in our dataframe using the describe() method.

In [11]:

can.describe()

Out[11]:

In [12]:

can.describe(include="object")

Out[12]:

pandas Intermediate: Indexing and Selection (slicing)

Select Column

There are two ways to filter on a column name:

Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.

    df.column_name               # returns series

Method 2: More robust, and can filter on multiple columns.

    df['column']                  # returns series

    df[['column 1', 'column 2']]  # returns dataframe

Example: Let's try filtering on the list of countries ('Country').

In [13]:

can.Country  # returns a series

Out[13]:

       Afghanistan
           Albania
           Algeria
    American Samoa
           Andorra
            ...      
        Viet Nam
  Western Sahara
           Yemen
          Zambia
        Zimbabwe
Name: Country, Length: 195, dtype: object

Let's try filtering on the list of countries ('Country') and the data for years: 1986 - 1990.

In [14]:

can[['Country', '1986', '1987', '1988', '1989', '1990']] # returns a dataframe
# notice that 'Country' is string, and the years are integers. 
# for the sake of consistency, we will convert all column names to string later on.

Out[14]:

Select Row

There are main 2 ways to select rows:

    df.loc[label]    # filters by the labels of the index/column
    df.iloc[index]   # filters by the positions of the index/column

Before we proceed, notice that the default index of the dataset is a numeric range from 0 to 194. This makes it very difficult to do a query by a specific country. For example to search for data on Japan, we need to know the corresponding index value.

This can be fixed very easily by setting the 'Country' column as the index using set_index() method.

Setting Country as index column

In [15]:

can.set_index('Country', inplace=True)
# tip: The opposite of set is reset. So to reset the index, we can use df_can.reset_index()

In [16]:

can.head(3)

Out[16]:

In [17]:

# optional: to remove the name of the index
can.index.name = None

Example: Let's view the number of immigrants from Algeria (row 2) for the following scenarios:

1. The full row data (all columns)
2. For year 2013
3. For years 1980 to 1985

In [18]:

# 1. the full row data (all columns)
can.loc['Algeria']

Out[18]:

Continent                Africa
Region          Northern Africa
DevName      Developing regions
                       80
                       67
                       71
                     69.0
                       63
                       44
                       69
                      132
                      242
                      434
                      491
                      872
                      795
                      717
                      595
                     1106
                     2054
                     1842
                     2292
                     2389
                     2867
                     3418
                     3406
                     3072
                     3616
                     3626
                     4807
                     3623
                     4005
                     5393
                     4752
                     4325
                     3774
                     4331
Total                     69439
Name: Algeria, dtype: object

In [19]:

# alternate methods
can.iloc[2]

Out[19]:

Continent                Africa
Region          Northern Africa
DevName      Developing regions
                       80
                       67
                       71
                     69.0
                       63
                       44
                       69
                      132
                      242
                      434
                      491
                      872
                      795
                      717
                      595
                     1106
                     2054
                     1842
                     2292
                     2389
                     2867
                     3418
                     3406
                     3072
                     3616
                     3626
                     4807
                     3623
                     4005
                     5393
                     4752
                     4325
                     3774
                     4331
Total                     69439
Name: Algeria, dtype: object

In [20]:

can[can.index == 'Algeria']

Out[20]:

In [21]:

# 2. for year 2005
can.loc['Algeria', '2005']

Out[21]:

3626

In [22]:

# 3. for years 1980 to 1985
can.loc['Algeria', ['1980', '1981', '1982', '1983', '1984', '1985']]

Out[22]:

    80
    67
    71
  69.0
    63
    44
Name: Algeria, dtype: object

Column names that are integers (such as the years) might introduce some confusion. For example, when we are referencing the year 2013, one might confuse that when the 2013th positional index.

To avoid this ambuigity, let's convert the column names into strings: '1980' to '2013'.

In [23]:

can.columns = list(map(str, can.columns))
# [print (type(x)) for x in df_can.columns.values] #<-- uncomment to check type of column headers

Since we converted the years to string, let's declare a variable that will allow us to easily call upon the full range of years:

In [24]:

# useful for plotting later on
years = list(map(str, range(1980, 2014)))
years

Out[24]:

['1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013']

Filtering based on a criteria

To filter the dataframe based on a condition, we simply pass the condition as a boolean vector.

For example, Let's filter the dataframe to show the data on African countries (AreaName = Africa).

In [25]:

# 1. create the condition boolean series
condition = can['Continent'] == 'Africa'
print(condition)

Out[25]:

Afghanistan       False
Albania           False
Algeria            True
American Samoa    False
Andorra           False
                  ...  
Viet Nam          False
Western Sahara     True
Yemen             False
Zambia             True
Zimbabwe           True
Name: Continent, Length: 195, dtype: bool

In [26]:

condition2 = can['DevName'] == 'Developed regions'
condition2

Out[26]:

Afghanistan       False
Albania            True
Algeria           False
American Samoa    False
Andorra           False
                  ...  
Viet Nam          False
Western Sahara    False
Yemen             False
Zambia            False
Zimbabwe          False
Name: DevName, Length: 195, dtype: bool

In [27]:

# 2. pass this condition into the dataFrame
a=can[condition2]
a

Out[27]:

In [28]:

# we can pass multiple criteria in the same line.
# let's filter for AreaNAme = Asia and RegName = Southern Asia

can[(can['Continent']=='Asia') & (can['Region']=='Southern Asia')] 

# note: When using 'and' and 'or' operators, pandas requires we use '&' and '|' instead of 'and' and 'or'
# don't forget to enclose the two conditions in parentheses

Out[28]:

Before we proceed: let's review the changes we have made to our dataframe.

In [29]:

print('data dimensions:', can.shape)
print(can.columns)
can.head(2)

Out[29]:

data dimensions: (195, 38)
Index(['Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', 'Total'],
      dtype='object')

Visualizing Data using Matplotlib

Matplotlib: Standard Python Visualization Library

The primary plotting library we will explore in the course is Matplotlib. As mentioned on their website:

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

If you are aspiring to create impactful visualization with python, Matplotlib is an essential tool to have at your disposal.

Matplotlib.Pyplot

One of the core aspects of Matplotlib is matplotlib.pyplot

Let's start by importing matplotlib and matplotlib.pyplot as follows:

In [30]:

# we are using the inline backend
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

*optional: check if Matplotlib is loaded.

In [31]:

print('Matplotlib version: ', mpl.__version__)  # >= 2.0.0

Out[31]:

Matplotlib version:  3.7.2

*optional: apply a style to Matplotlib.

In [32]:

print(plt.style.available)
mpl.style.use(['ggplot']) # optional: for ggplot-like style

Out[32]:

['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-v0_8', 'seaborn-v0_8-bright', 'seaborn-v0_8-colorblind', 'seaborn-v0_8-dark', 'seaborn-v0_8-dark-palette', 'seaborn-v0_8-darkgrid', 'seaborn-v0_8-deep', 'seaborn-v0_8-muted', 'seaborn-v0_8-notebook', 'seaborn-v0_8-paper', 'seaborn-v0_8-pastel', 'seaborn-v0_8-poster', 'seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-white', 'seaborn-v0_8-whitegrid', 'tableau-colorblind10']

Plotting in pandas

Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in pandas is as simple as appending a .plot() method to a series or dataframe.

Documentation:

Line Plots (Series/Dataframe)

What is a line plot and why use it?

A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.

Let's start with a case study:

In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The quake caused widespread devastation and loss of life and aout three million people were affected by this natural disaster. As part of Canada's humanitarian effort, the Government of Canada stepped up its effort in accepting refugees from Haiti. We can quickly visualize this effort using a Line plot:

Question: Plot a line graph of immigration from Haiti using df.plot().

First, we will extract the data series for Haiti.

In [33]:

haiti = can.loc['Haiti', years] # passing in years 1980 - 2013 to exclude the 'total' column
haiti.head()

Out[33]:

    1666
    3692
    3498
  2860.0
    1418
Name: Haiti, dtype: object

Next, we will plot a line plot by appending .plot() to the haiti dataframe.

In [34]:

haiti.plot()

Out[34]:

<Axes: >

pandas automatically populated the x-axis with the index values (years), and the y-axis with the column values (population). However, notice how the years were not displayed because they are of type string. Therefore, let's change the type of the index values to integer for plotting.

Also, let's label the x and y axis using plt.title(), plt.ylabel(), and plt.xlabel() as follows:

In [35]:

haiti.index = haiti.index.map(int) # let's change the index values of Haiti to type integer for plotting
haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

plt.show() # need this line to show the updates made to the figure

Out[35]:

We can clearly notice how number of immigrants from Haiti spiked up from 2010 as Canada stepped up its efforts to accept refugees from Haiti. Let's annotate this spike in the plot by using the plt.text() method.

In [36]:

haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

# annotate the 2010 Earthquake. 
# syntax: plt.text(x, y, label)
plt.text(2009, 6500, '2010 Earthquake') 

plt.show()

Out[36]:

With just a few lines of code, you were able to quickly identify and visualize the spike in immigration!

Quick note on x and y values in plt.text(x, y, label):

 Since the x-axis (years) is type 'integer', we specified x as a year. The y axis (number of immigrants) is type 'integer', so we can just specify the value y = 6000.

    plt.text(2000, 6000, '2010 Earthquake') # years stored as type int

If the years were stored as type 'string', we would need to specify x as the index position of the year. Eg 20th index is year 2000 since it is the 20th year with a base year of 1980.

    plt.text(20, 6000, '2010 Earthquake') # years stored as type int

We will cover advanced annotation methods in later modules.

We can easily add more countries to line plot to make meaningful comparisons immigration from different countries.

Question: Let's compare the number of immigrants from India and China from 1980 to 2013.

Step 1: Get the data set for China and India, and display the dataframe.

In [37]:


df_CI = can.loc[['India', 'China'], years]
df_CI

Out[37]:

Step 2: Plot graph. We will explicitly specify line plot by passing in kind parameter to plot().

In [38]:


df_CI.plot(kind='line')

Out[38]:

<Axes: >

That doesn't look right...

Recall that pandas plots the indices on the x-axis and the columns as individual lines on the y-axis. Since df_CI is a dataframe with the country as the index and years as the columns, we must first transpose the dataframe using transpose() method to swap the row and columns.

In [39]:

df_CI = df_CI.transpose()
df_CI.head()

Out[39]:

pandas will auomatically graph the two countries on the same graph. Go ahead and plot the new transposed dataframe. Make sure to add a title to the plot and label the axes.

In [40]:



df_CI.index = df_CI.index.map(int) # let's change the index values of df_CI to type integer for plotting
df_CI.plot(kind='line',figsize=(15, 8))

plt.title('Immigrants from China and India')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

plt.show()

Out[40]:

From the above plot, we can observe that the China and India have very similar immigration trends through the years.

Note: How come we didn't need to transpose Haiti's dataframe before plotting (like we did for df_CI)?

That's because haiti is a series as opposed to a dataframe, and has the years as its indices as shown below.

print(type(haiti))
print(haiti.head(5))

class 'pandas.core.series.Series'
1980 1666
1981 3692
1982 3498
1983 2860
1984 1418
Name: Haiti, dtype: int64

Line plot is a handy tool to display several dependent variables against one independent variable. However, it is recommended that no more than 5-10 lines on a single graph; any more than that and it becomes difficult to interpret.

Question: Compare the trend of top 5 countries that contributed the most to immigration to Canada.

In [41]:

#The correct answer is:    
    #Step 1: Get the dataset. Recall that we created a Total column that calculates cumulative immigration by country. 
    #We will sort on this column to get our top 5 countries using pandas sort_values() method.
    
inplace = True # paramemter saves the changes to the original can dataframe
can.sort_values(by='Total', ascending=False, axis=0, inplace=True)

# get the top 5 entries
df_top5 = can.head(5)

# transpose the dataframe
df_top5 = df_top5[years].transpose() 

print(df_top5)

Out[41]:

        India    China  United Kingdom of Great Britain and Northern Ireland  \
 8880.0   5123.0                                            22045.0      
 8670.0   6682.0                                            24796.0      
 8147.0   3308.0                                            20620.0      
 7338.0   1863.0                                            10015.0      
 5704.0   1527.0                                            10170.0      
 4211.0   1816.0                                             9564.0      
 7150.0   1960.0                                             9470.0      
10189.0   2643.0                                            21337.0      
11522.0   2758.0                                            27359.0      
10343.0   4323.0                                            23795.0      
12041.0   8076.0                                            31668.0      
13734.0  14255.0                                            23380.0      
13673.0  10846.0                                            34123.0      
21496.0   9817.0                                            33720.0      
18620.0  13128.0                                            39231.0      
18489.0  14398.0                                            30145.0      
23859.0  19415.0                                            29322.0      
22268.0  20475.0                                            22965.0      
17241.0  21049.0                                            10367.0      
18974.0  30069.0                                             7045.0      
28572.0  35529.0                                             8840.0      
31223.0  36434.0                                            11728.0      
31889.0  31961.0                                             8046.0      
27155.0  36439.0                                             6797.0      
28235.0  36619.0                                             7533.0      
36210.0  42584.0                                             7258.0      
33848.0  33518.0                                             7140.0      
28742.0  27642.0                                             8216.0      
28261.0  30037.0                                             8979.0      
29456.0  29622.0                                             8876.0      
34235.0  30391.0                                             8724.0      
27509.0  28502.0                                             6204.0      
30933.0  33024.0                                             6195.0      
33087.0  34129.0                                             5827.0      

      Philippines  Pakistan  
     6051.0     978.0  
     5921.0     972.0  
     5249.0    1201.0  
     4562.0     900.0  
     3801.0     668.0  
     3150.0     514.0  
     4166.0     691.0  
     7360.0    1072.0  
     8639.0    1334.0  
    11865.0    2261.0  
    12509.0    2470.0  
    12718.0    3079.0  
    13670.0    4071.0  
    20479.0    4777.0  
    19532.0    4666.0  
    15864.0    4994.0  
    13692.0    9125.0  
    11549.0   13073.0  
     8735.0    9068.0  
     9734.0    9979.0  
    10763.0   15400.0  
    13836.0   16708.0  
    11707.0   15110.0  
    12758.0   13205.0  
    14004.0   13399.0  
    18139.0   14314.0  
    18400.0   13127.0  
    19837.0   10124.0  
    24887.0    8994.0  
    28573.0    7217.0  
    38617.0    6811.0  
    36765.0    7468.0  
    34315.0   11227.0  
    29544.0   12603.0  

In [42]:





#Step 2: Plot the dataframe. To make the plot more readeable, we will change the size using the `figsize` parameter.
df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting
df_top5.plot(kind='line', figsize=(15, 8)) # pass a tuple (x, y) size


plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')


plt.show()

Out[42]: