Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Data Visualization using Python/4.2 Advanced Viz Using matplotlib and Seaborne Waffle Chart and Regression Plots.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Waffle Charts and Regression Plots

Objectives

  • Create Waffle charts

  • Create regression plots with Seaborn library

Exploring Datasets with pandas and Matplotlib

Toolkits: The course heavily relies on pandas and Numpy for data wrangling, analysis, and visualization. The primary plotting library we will explore in the course is Matplotlib.

Dataset: Immigration to Canada from 1980 to 2013 - International migration flows to and from selected countries - The 2015 revision from United Nation's website

The dataset contains annual data on the flows of international migrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. In this lab, we will focus on the Canadian Immigration data.

Data Prepartion

Import Primary Modules:

import numpy as np # useful for many scientific computing in Python import pandas as pd # primary data structure library from PIL import Image # converting images into arrays
import numpy as np # useful for many scientific computing in Python import pandas as pd # primary data structure library can=pd.read_csv("canadian_im.csv")

Download the Canadian Immigration dataset and read it into a pandas dataframe.

Let's take a look at the first five items in our dataset

can.head()

Let's find out how many entries there are in our dataset

# print the dimensions of the dataframe print(can.shape)
(195, 39)

Clean up data. We will make some modifications to the original dataset to make it easier to create our visualizations. Refer to Introduction to Matplotlib and Line Plots and Area Plots, Histograms, and Bar Plots for a detailed description of this preprocessing.

# for sake of consistency, let's also make all column labels of type string can.columns = list(map(str, can.columns)) # set the country name as index - useful for quickly looking up countries using .loc method can.set_index('Country', inplace=True) # add total column can['Total'] = can.sum(axis=1) # years that we will be using in this lesson - useful for plotting later on years = list(map(str, range(1980, 2014))) print('data dimensions:', can.shape)
data dimensions: (195, 38)
C:\Users\suyashi144893\AppData\Local\Temp\1\ipykernel_12896\632924427.py:8: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. can['Total'] = can.sum(axis=1)

Visualizing Data using Matplotlib

Import and setup matplotlib:

%matplotlib inline import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.patches as mpatches # needed for waffle Charts mpl.style.use('ggplot') # optional: for ggplot-like style # check for latest version of Matplotlib print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0
Matplotlib version: 3.5.1

Waffle Charts

A waffle chart is an interesting visualization that is normally created to display progress toward goals. It is commonly an effective option when you are trying to add interesting visualization features to a visual that consists mainly of cells, such as an Excel dashboard.

Let's revisit the previous case study about Denmark, Norway, and Sweden.

# let's create a new dataframe for these three countries df_dsn = can.loc[['Denmark', 'Norway', 'Sweden'], :] # let's take a look at our dataframe df_dsn

Unfortunately, unlike R, waffle charts are not built into any of the Python visualization libraries. Therefore, we will learn how to create them from scratch.

Step 1. The first step into creating a waffle chart is determing the proportion of each category with respect to the total.

# compute the proportion of each category with respect to the total total_values = df_dsn['Total'].sum() category_proportions = df_dsn['Total'] / total_values # print out proportions pd.DataFrame({"Category Proportion": category_proportions})

Step 2. The second step is defining the overall size of the waffle chart.

width = 40 # width of chart height = 10 # height of chart total_num_tiles = width * height # total number of tiles print(f'Total number of tiles is {total_num_tiles}.')
Total number of tiles is 400.

Step 3. The third step is using the proportion of each category to determe it respective number of tiles

# compute the number of tiles for each category tiles_per_category = (category_proportions * total_num_tiles).round().astype(int) # print out number of tiles per category pd.DataFrame({"Number of tiles": tiles_per_category})

Based on the calculated proportions, Denmark will occupy 129 tiles of the waffle chart, Norway will occupy 77 tiles, and Sweden will occupy 194 tiles.

Step 4. The fourth step is creating a matrix that resembles the waffle chart and populating it.

# initialize the waffle chart as an empty matrix waffle_chart = np.zeros((height, width), dtype = np.uint) # define indices to loop through waffle chart category_index = 0 tile_index = 0 # populate the waffle chart for col in range(width): for row in range(height): tile_index += 1 # if the number of tiles populated for the current category is equal to its corresponding allocated tiles... if tile_index > sum(tiles_per_category[0:category_index]): # ...proceed to the next category category_index += 1 # set the class value to an integer, which increases with class waffle_chart[row, col] = category_index print ('Waffle chart populated!')
Waffle chart populated!

Let's take a peek at how the matrix looks like.

waffle_chart
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]], dtype=uint32)

As expected, the matrix consists of three categories and the total number of each category's instances matches the total number of tiles allocated to each category.

Step 5. Map the waffle chart matrix into a visual.

# instantiate a new figure object fig = plt.figure() # use matshow to display the waffle chart colormap = plt.cm.coolwarm plt.matshow(waffle_chart, cmap=colormap) plt.colorbar() plt.show()
C:\Users\suyashi144893\AppData\Local\Temp\1\ipykernel_12896\103890981.py:7: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first. plt.colorbar()
<Figure size 432x288 with 0 Axes>
Image in a Jupyter notebook

Step 6. Prettify the chart.

# instantiate a new figure object fig = plt.figure() # use matshow to display the waffle chart colormap = plt.cm.coolwarm plt.matshow(waffle_chart, cmap=colormap) plt.colorbar() # get the axis ax = plt.gca() # set minor ticks ax.set_xticks(np.arange(-.5, (width), 1), minor=True) ax.set_yticks(np.arange(-.5, (height), 1), minor=True) # add gridlines based on minor ticks ax.grid(which='minor', color='w', linestyle='-', linewidth=2) plt.xticks([]) plt.yticks([]) plt.show()
C:\Users\suyashi144893\AppData\Local\Temp\1\ipykernel_12896\1261046109.py:7: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first. plt.colorbar()
<Figure size 432x288 with 0 Axes>
Image in a Jupyter notebook

Step 7. Create a legend and add it to chart.

# instantiate a new figure object fig = plt.figure() # use matshow to display the waffle chart colormap = plt.cm.coolwarm plt.matshow(waffle_chart, cmap=colormap) plt.colorbar() # get the axis ax = plt.gca() # set minor ticks ax.set_xticks(np.arange(-.5, (width), 1), minor=True) ax.set_yticks(np.arange(-.5, (height), 1), minor=True) # add gridlines based on minor ticks ax.grid(which='minor', color='w', linestyle='-', linewidth=2) plt.xticks([]) plt.yticks([]) # compute cumulative sum of individual categories to match color schemes between chart and legend values_cumsum = np.cumsum(df_dsn['Total']) total_values = values_cumsum[len(values_cumsum) - 1] # create legend legend_handles = [] for i, category in enumerate(df_dsn.index.values): label_str = category + ' (' + str(df_dsn['Total'][i]) + ')' color_val = colormap(float(values_cumsum[i])/total_values) legend_handles.append(mpatches.Patch(color=color_val, label=label_str)) # add legend to chart plt.legend(handles=legend_handles, loc='lower center', ncol=len(df_dsn.index.values), bbox_to_anchor=(0., -0.2, 0.95, .1) ) plt.show()
C:\Users\suyashi144893\AppData\Local\Temp\1\ipykernel_12896\2463873726.py:7: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first. plt.colorbar()
<Figure size 432x288 with 0 Axes>
Image in a Jupyter notebook

And there you go! What a good looking delicious waffle chart, don't you think?

Now it would very inefficient to repeat these seven steps every time we wish to create a waffle chart. So let's combine all seven steps into one function called create_waffle_chart. This function would take the following parameters as input:

  1. categories: Unique categories or classes in dataframe.

  2. values: Values corresponding to categories or classes.

  3. height: Defined height of waffle chart.

  4. width: Defined width of waffle chart.

  5. colormap: Colormap class

  6. value_sign: In order to make our function more generalizable, we will add this parameter to address signs that could be associated with a value such as %, $, and so on. value_sign has a default value of empty string.

def create_waffle_chart(categories, values, height, width, colormap, value_sign=''): # compute the proportion of each category with respect to the total total_values = sum(values) category_proportions = [(float(value) / total_values) for value in values] # compute the total number of tiles total_num_tiles = width * height # total number of tiles print ('Total number of tiles is', total_num_tiles) # compute the number of tiles for each catagory tiles_per_category = [round(proportion * total_num_tiles) for proportion in category_proportions] # print out number of tiles per category for i, tiles in enumerate(tiles_per_category): print (df_dsn.index.values[i] + ': ' + str(tiles)) # initialize the waffle chart as an empty matrix waffle_chart = np.zeros((height, width)) # define indices to loop through waffle chart category_index = 0 tile_index = 0 # populate the waffle chart for col in range(width): for row in range(height): tile_index += 1 # if the number of tiles populated for the current category # is equal to its corresponding allocated tiles... if tile_index > sum(tiles_per_category[0:category_index]): # ...proceed to the next category category_index += 1 # set the class value to an integer, which increases with class waffle_chart[row, col] = category_index # instantiate a new figure object fig = plt.figure() # use matshow to display the waffle chart colormap = plt.cm.coolwarm plt.matshow(waffle_chart, cmap=colormap) plt.colorbar() # get the axis ax = plt.gca() # set minor ticks ax.set_xticks(np.arange(-.5, (width), 1), minor=True) ax.set_yticks(np.arange(-.5, (height), 1), minor=True) # add dridlines based on minor ticks ax.grid(which='minor', color='w', linestyle='-', linewidth=2) plt.xticks([]) plt.yticks([]) # compute cumulative sum of individual categories to match color schemes between chart and legend values_cumsum = np.cumsum(values) total_values = values_cumsum[len(values_cumsum) - 1] # create legend legend_handles = [] for i, category in enumerate(categories): if value_sign == '%': label_str = category + ' (' + str(values[i]) + value_sign + ')' else: label_str = category + ' (' + value_sign + str(values[i]) + ')' color_val = colormap(float(values_cumsum[i])/total_values) legend_handles.append(mpatches.Patch(color=color_val, label=label_str)) # add legend to chart plt.legend( handles=legend_handles, loc='lower center', ncol=len(categories), bbox_to_anchor=(0., -0.2, 0.95, .1) ) plt.show()

Now to create a waffle chart, all we have to do is call the function create_waffle_chart. Let's define the input parameters:

width = 40 # width of chart height = 10 # height of chart categories = df_dsn.index.values # categories values = df_dsn['Total'] # correponding values of categories colormap = plt.cm.coolwarm # color map class

And now let's call our function to create a waffle chart.

create_waffle_chart(categories, values, height, width, colormap)
Total number of tiles is 400 Denmark: 129 Norway: 77 Sweden: 194
C:\Users\suyashi144893\AppData\Local\Temp\1\ipykernel_12896\3286913405.py:45: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first. plt.colorbar()
<Figure size 432x288 with 0 Axes>
Image in a Jupyter notebook

There seems to be a new Python package for generating waffle charts called PyWaffle, but it looks like the repository is still being built. But feel free to check it out and play with it.

Regression Plots

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. You can learn more about seaborn by following this link and more about seaborn regression plots by following this link.

In lab Pie Charts, Box Plots, Scatter Plots, and Bubble Plots, we learned how to create a scatter plot and then fit a regression line. It took ~20 lines of code to create the scatter plot along with the regression fit. In this final section, we will explore seaborn and see how efficient it is to create regression lines and fits using this library!

Let's first install seaborn

# install seaborn # !pip3 install seaborn # import library import seaborn as sns print('Seaborn installed and imported!')
Seaborn installed and imported!

Create a new dataframe that stores that total number of landed immigrants to Canada per year from 1980 to 2013.

# we can use the sum() method to get the total population per year df_tot = pd.DataFrame(can[years].sum(axis=0)) # change the years to type float (useful for regression later on) df_tot.index = map(float, df_tot.index) # reset the index to put in back in as a column in the df_tot dataframe df_tot.reset_index(inplace=True) # rename columns df_tot.columns = ['year', 'total'] # view the final dataframe df_tot.head()

With seaborn, generating a regression plot is as simple as calling the regplot function.

sns.regplot(x='year', y='total', data=df_tot)
<AxesSubplot:xlabel='year', ylabel='total'>
Image in a Jupyter notebook

This is not magic; it is seaborn! You can also customize the color of the scatter plot and regression line. Let's change the color to green.

sns.regplot(x='year', y='total', data=df_tot, color='green') plt.show()
Image in a Jupyter notebook

You can always customize the marker shape, so instead of circular markers, let's use +.

ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+') plt.show()
Image in a Jupyter notebook

Let's blow up the plot a little so that it is more appealing to the sight.

plt.figure(figsize=(15, 10)) sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+') plt.show()
Image in a Jupyter notebook

And let's increase the size of markers so they match the new size of the figure, and add a title and x- and y-labels.

plt.figure(figsize=(15, 10)) ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+', scatter_kws={'s': 200}) ax.set(xlabel='Year', ylabel='Total Immigration') # add x- and y-labels ax.set_title('Total Immigration to Canada from 1980 - 2013') # add title plt.show()
Image in a Jupyter notebook

And finally increase the font size of the tickmark labels, the title, and the x- and y-labels so they don't feel left out!

plt.figure(figsize=(15, 10)) sns.set(font_scale=1.5) ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+', scatter_kws={'s': 200}) ax.set(xlabel='Year', ylabel='Total Immigration') ax.set_title('Total Immigration to Canada from 1980 - 2013') plt.show()
Image in a Jupyter notebook

Amazing! A complete scatter plot with a regression fit with 5 lines of code only. Isn't this really amazing?

If you are not a big fan of the purple background, you can easily change the style to a white plain background.

plt.figure(figsize=(15, 10)) sns.set(font_scale=1.5) sns.set_style('ticks') # change background to white background ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+', scatter_kws={'s': 200}) ax.set(xlabel='Year', ylabel='Total Immigration') ax.set_title('Total Immigration to Canada from 1980 - 2013') plt.show()
Image in a Jupyter notebook

Or to a white background with gridlines.

plt.figure(figsize=(15, 10)) sns.set(font_scale=1.5) sns.set_style('whitegrid') ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+', scatter_kws={'s': 200}) ax.set(xlabel='Year', ylabel='Total Immigration') ax.set_title('Total Immigration to Canada from 1980 - 2013') plt.show()
Image in a Jupyter notebook

Question: Use seaborn to create a scatter plot with a regression line to visualize the total immigration from Denmark, Sweden, and Norway to Canada from 1980 to 2013.

# create df_countries dataframe df_countries = can.loc[['Denmark', 'Norway', 'Sweden'], years].transpose() # create df_total by summing across three countries for each year df_total = pd.DataFrame(df_countries.sum(axis=1)) # reset index in place df_total.reset_index(inplace=True) # rename columns df_total.columns = ['year', 'total'] # change column year from string to int to create scatter plot df_total['year'] = df_total['year'].astype(int) # define figure size plt.figure(figsize=(15, 10)) # define background style and font size sns.set(font_scale=1.5) sns.set_style('whitegrid') # generate plot and add title and axes labels ax = sns.regplot(x='year', y='total', data=df_total, color='green', marker='+', scatter_kws={'s': 200}) ax.set(xlabel='Year', ylabel='Total Immigration') ax.set_title('Total Immigrationn from Denmark, Sweden, and Norway to Canada from 1980 - 2013')
Text(0.5, 1.0, 'Total Immigrationn from Denmark, Sweden, and Norway to Canada from 1980 - 2013')
Image in a Jupyter notebook