Path: blob/master/Data Science using Python/Day 4 Data Visualization.ipynb
3074 views
Data Visualization
Objectives
Create Data Visualization with Python
Use various Python libraries for visualization
https://mockaroo.com/ (for sample data generation)
Let's view the top 2 rows of the dataset using the head()
function.
When analyzing a dataset, it's always a good idea to start by getting basic information about your dataframe. We can do this by using the info()
method.
This method can be used to get a short summary of the dataframe.
Insight
We have immgration information for 195 Countries form 1980 - 2013
To get the list of column headers we can call upon the data frame's columns
instance variable.
Similarly, to get the list of indices we use the .index
instance variables.
To get the index and columns as lists, we can use the tolist()
method.
Statistical Analysis
Finally, let's view a quick summary of each column in our dataframe using the describe()
method.
Insights
Total 195 Countries
22 Different Regions and highest immigration to Canada is form Western Asia
78% of immigration to Canada is form Developing Regions
Select Column
There are two ways to filter on a column name:
Method 1: Quick and easy, but only works if the column name does NOT have spaces or special characters.
Method 2: More robust, and can filter on multiple columns.
Example: Let's try filtering on the list of countries ('Country').
Let's try filtering on the list of countries ('Country') and the data for years: 1980 - 1985.
Select Row
There are main 2 ways to select rows:
Before we proceed, notice that the default index of the dataset is a numeric range from 0 to 194. This makes it very difficult to do a query by a specific country. For example to search for data on Japan, we need to know the corresponding index value.
This can be fixed very easily by setting the 'Country' column as the index using set_index()
method.
Setting Country as index column
Example: Let's view the number of immigrants from Japan (row 87) for the following scenarios:
1. The full row data (all columns)
2. For year 2013
3. For years 1980 to 1985
Quick Parctice
How many number of immigrants from india in 2000?
How many immigrants from China in 1998?
Number of immigrants from India between 1988-2000?
Column names that are integers (such as the years) might introduce some confusion. For example, when we are referencing the year 2013, one might confuse that when the 2013th positional index.
To avoid this ambuigity, let's convert the column names into strings: '1980' to '2013'.
Since we converted the years to string, let's declare a variable that will allow us to easily call upon the full range of years:
Filtering based on a criteria
To filter the dataframe based on a condition, we simply pass the condition as a boolean vector.
For example, Let's filter the dataframe to show the data on Asian countries (AreaName = Asia).
QP
Create a condition to filter Africa data
Before we proceed: let's review the changes we have made to our dataframe.
Matplotlib: Standard Python Visualization Library
The primary plotting library we will explore in the course is Matplotlib. As mentioned on their website:
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.
If you are aspiring to create impactful visualization with python, Matplotlib is an essential tool to have at your disposal.
Matplotlib.Pyplot
One of the core aspects of Matplotlib is matplotlib.pyplot
Let's start by importing matplotlib
and matplotlib.pyplot
as follows:
*optional: check if Matplotlib is loaded.
Plotting in pandas
Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in pandas is as simple as appending a .plot()
method to a series or dataframe.
Documentation:
What is a line plot and why use it?
A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.
Let's start with a case study:
In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The quake caused widespread devastation and loss of life and aout three million people were affected by this natural disaster. As part of Canada's humanitarian effort, the Government of Canada stepped up its effort in accepting refugees from Haiti. We can quickly visualize this effort using a Line
plot:
Question: Plot a line graph of immigration from Haiti using df.plot()
.
(Q.P - Plot a line graph of immigration from China using `df.plot())
First, we will extract the data series for Haiti.
Next, we will plot a line plot by appending .plot()
to the haiti
dataframe.
pandas automatically populated the x-axis with the index values (years), and the y-axis with the column values (population). However, notice how the years were not displayed because they are of type string. Therefore, let's change the type of the index values to integer for plotting.
Also, let's label the x and y axis using plt.title()
, plt.ylabel()
, and plt.xlabel()
as follows:
We can clearly notice how number of immigrants from Haiti spiked up from 2010 as Canada stepped up its efforts to accept refugees from Haiti. Let's annotate this spike in the plot by using the plt.text()
method.
With just a few lines of code, you were able to quickly identify and visualize the spike in immigration!
Quick note on x and y values in plt.text(x, y, label)
:
We can easily add more countries to line plot to make meaningful comparisons immigration from different countries.
Question: Let's compare the number of immigrants from India and China from 1980 to 2013.
Q.P: "Compare immigrants from Srilanka and Bhutan from 1980 tO 2013
Step 1: Get the data set for China and India, and display the dataframe.
Step 2: Plot graph. We will explicitly specify line plot by passing in kind
parameter to plot()
.
That doesn't look right...
Recall that pandas plots the indices on the x-axis and the columns as individual lines on the y-axis. Since df_CI
is a dataframe with the country
as the index and years
as the columns, we must first transpose the dataframe using transpose()
method to swap the row and columns.
pandas will auomatically graph the two countries on the same graph. Go ahead and plot the new transposed dataframe. Make sure to add a title to the plot and label the axes.
From the above plot, we can observe that the China and India have very similar immigration trends through the years.
Note: How come we didn't need to transpose Haiti's dataframe before plotting (like we did for df_CI)?
That's because haiti
is a series as opposed to a dataframe, and has the years as its indices as shown below.
class 'pandas.core.series.Series'
1980 1666
1981 3692
1982 3498
1983 2860
1984 1418
Name: Haiti, dtype: int64
Line plot is a handy tool to display several dependent variables against one independent variable. However, it is recommended that no more than 5-10 lines on a single graph; any more than that and it becomes difficult to interpret.
Question: Compare the trend of top 5 countries that contributed the most to immigration to Canada.
Q.P: Compare the trend of bottom 5 countries that contributed least to immigration to Canada.
Area Plots
Area plots are stacked by default. And to produce a stacked area plot, each column must be either all positive or all negative values (any NaN, i.e. not a number, values will default to 0). To produce an unstacked plot, set parameter stacked to value False.
The unstacked plot has a default transparency (alpha value) at 0.5. We can modify this value by passing in the alpha parameter.
Two types of plotting
**Option 1: Scripting layer (procedural method) - using matplotlib.pyplot as 'plt' **
You can use plt i.e. matplotlib.pyplot and add more elements by calling different methods procedurally; for example, plt.title(...) to add title or plt.xlabel(...) to add label to the x-axis.
**Option 2: Artist layer (Object oriented method) - using an Axes instance from Matplotlib (preferred) **
You can use an Axes instance of your current plot and store it in a variable (eg. ax). You can add more elements by calling methods with a little change in syntax (by adding "set_" to the previous methods). For example, use ax.set_title() instead of plt.title() to add title, or ax.set_xlabel() instead of plt.xlabel() to add label to the x-axis.
This option sometimes is more transparent and flexible to use for advanced plots (in particular when having multiple plots, as you will see later).
In this course, we will stick to the scripting layer, except for some advanced visualizations where we will need to use the artist layer to manipulate advanced aspects of the plots.
Question: Use the scripting layer to create a stacked area plot of the 5 countries that contributed the least to immigration to Canada from 1980 to 2013. Use a transparency value of 0.45.
Quick Practice
Question: Use the artist layer to create an unstacked area plot of the 5 countries that contributed the least to immigration to Canada from 1980 to 2013. Use a transparency value of 0.55.
Histograms
A histogram is a way of representing the frequency distribution of numeric dataset. The way it works is it partitions the x-axis into bins, assigns each data point in our dataset to a bin, and then counts the number of data points that have been assigned to each bin. So the y-axis is the frequency or the number of data points in each bin. Note that we can change the bin size and usually one needs to tweak it so that the distribution is displayed nicely.
Question: What is the frequency distribution of the number (population) of new immigrants from the various countries to Canada in 2013?
Before we proceed with creating the histogram plot, let's first examine the data split into intervals. To do this, we will us Numpy's histrogram
method to get the bin ranges and frequency counts as follows:
By default, the histrogram
method breaks up the dataset into 10 bins. The figure below summarizes the bin ranges and the frequency distribution of immigration in 2013. We can see that in 2013:
178 countries contributed between 0 to 3412.9 immigrants
11 countries contributed between 3412.9 to 6825.8 immigrants
1 country contributed between 6285.8 to 10238.7 immigrants, and so on.. We can easily graph this distribution by passing kind=hist to plot().
In the above plot, the x-axis represents the population range of immigrants in intervals of 3412.9. The y-axis represents the number of countries that contributed to the aforementioned population.
Notice that the x-axis labels do not match with the bin size. This can be fixed by passing in a xticks keyword that contains the list of the bin sizes, as follows:
We can also plot multiple histograms on the same plot. For example, let's try to answer the following questions using a histogram.
Question: What is the immigration distribution for Denmark, Norway, and Sweden for years 1980 - 2013?
Instead of plotting the population frequency distribution of the population for the 3 countries, pandas instead plotted the population frequency distribution for the years
.
This can be easily fixed by first transposing the dataset, and then plotting as shown below.
Let's make a few modifications to improve the impact and aesthetics of the previous plot:
increase the bin size to 15 by passing in
bins
parameter;set transparency to 60% by passing in
alpha
parameter;label the x-axis by passing in
x-label
parameter;change the colors of the plots by passing in
color
parameter.
If we do not want the plots to overlap each other, we can stack them using the stacked parameter. Let's also adjust the min and max x-axis labels to remove the extra gap on the edges of the plot. We can pass a tuple (min,max) using the xlim paramater, as show below.
Question: Write a code to display the immigration distribution for Greece, Albania, and Bulgaria for years 1980 - 2013? Use an overlapping plot with 15 bins and a transparency value of 0.35.
Bar Charts
A bar plot is a way of representing data where the length of the bars represents the magnitude/size of the feature/variable. Bar graphs usually represent numerical and categorical variables grouped in intervals.
To create a bar plot, we can pass one of two arguments via kind
parameter in plot()
:
kind=bar
creates a vertical bar plotkind=barh
creates a horizontal bar plot
Compare the number of Icelandic immigrants (country = 'Iceland') to Canada from year 1980 to 2013.
The bar plot above shows the total number of immigrants broken down by each year. We can clearly see the impact of the financial crisis; the number of immigrants to Canada started increasing rapidly after 2008.
Let's annotate this on the plot using the annotate method of the scripting layer or the pyplot interface. We will pass in the following parameters:
s: str, the text of annotation.
xy: Tuple specifying the (x,y) point to annotate (in this case, end point of arrow).
xytext: Tuple specifying the (x,y) point to place the text (in this case, start point of arrow).
xycoords: The coordinate system that xy is given in - 'data' uses the coordinate system of the object being annotated (default).
arrowprops: Takes a dictionary of properties to draw the arrow:
arrowstyle: Specifies the arrow style, '->' is standard arrow.
connectionstyle: Specifies the connection type. arc3 is a straight line.
color: Specifies color of arrow.
lw: Specifies the line width read the Matplotlib documentation for more details on annotations: http://matplotlib.orsg/api/pyplot_api.html#matplotlib.pyplot.annotate.
Horizontal Bar Plot
Sometimes it is more practical to represent the data horizontally, especially if you need more room for labelling the bars. In horizontal bar graphs, the y-axis is used for labelling, and the length of bars on the x-axis corresponds to the magnitude of the variable being measured. As you will see, there is more room on the y-axis to label categorical variables.
Question: Using the scripting later and the df_can dataset, create a horizontal bar plot showing the total number of immigrants to Canada from the top 15 countries, for the period 1980 - 2013. Label each country with the total immigrant count.
Step 2: Plot data:
Use
kind='barh'
to generate a bar chart with horizontal bars.Make sure to choose a good size for the plot and to label your axes and to give the plot a title.
Loop through the countries and annotate the immigrant population using the anotate function of the scripting interface.
Quick Practice:
Please compare Immigration for Japan and China
Create a viz to show immigration for bottom 5 countries
Compare immigration rate for Asia and S. Africa