Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/08. Data Visualization with Python/06. Advanced Visualizations - Waffle Charts, Word Clouds, and Regression Plots.ipynb
Views: 4585
Advanced Visualizations - Waffle Charts, Word Clouds, and Regression Plots
Exploring Datasets with pandas and Matplotlib
Toolkits: The course heavily relies on pandas and Numpy for data wrangling, analysis, and visualization. The primary plotting library we will explore in the course is Matplotlib.
Dataset: Immigration to Canada from 1980 to 2013 - International migration flows to and from selected countries - The 2015 revision from United Nation's website
The dataset contains annual data on the flows of international migrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. In this lab, we will focus on the Canadian Immigration data.
The first thing we'll do is import two key data analysis modules: pandas and numpy. We will also import the image module to convert images into arrays.
Let's download and import our primary Canadian Immigration dataset using pandas's read_csv()
method.
The file was originally downloaded from 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data Files/Canada.xlsx', and then prepared in the previous notebook.
Set the country name as index - useful for quickly looking up countries using .loc method
Make a list of the years between 1980 and 2014.
Import and setup matplotlib
:
Let's revisit the previous case study about Denmark, Norway, and Sweden.
Unfortunately, unlike R, waffle
charts are not built into any of the Python visualization libraries. Therefore, we will learn how to create them from scratch.
Step 1. The first step into creating a waffle chart is determing the proportion of each category with respect to the total.
Step 2. The second step is defining the overall size of the waffle
chart.
Step 3. The third step is using the proportion of each category to determe it respective number of tiles
Based on the calculated proportions, Denmark will occupy 129 tiles of the waffle
chart, Norway will occupy 77 tiles, and Sweden will occupy 194 tiles.
Step 4. The fourth step is creating a matrix that resembles the waffle
chart and populating it.
Let's take a peek at how the matrix looks like.
As expected, the matrix consists of three categories and the total number of each category's instances matches the total number of tiles allocated to each category.
Step 5. Map the waffle
chart matrix into a visual.
Step 6. Prettify the chart.
Step 7. Create a legend and add it to chart.
And there you go! What a good looking delicious waffle
chart, don't you think?
Now it would very inefficient to repeat these seven steps every time we wish to create a waffle
chart. So let's combine all seven steps into one function called create_waffle_chart. This function would take the following parameters as input:
categories: Unique categories or classes in dataframe.
values: Values corresponding to categories or classes.
height: Defined height of waffle chart.
width: Defined width of waffle chart.
colormap: Colormap class
value_sign: In order to make our function more generalizable, we will add this parameter to address signs that could be associated with a value such as %, $, and so on. value_sign has a default value of empty string.
Now to create a waffle
chart, all we have to do is call the function create_waffle_chart
. Let's define the input parameters:
And now let's call our function to create a waffle
chart.
There seems to be a new Python package for generating waffle charts
called PyWaffle, but it looks like the repository is still being built. But feel free to check it out and play with it.
Luckily, a Python package already exists in Python for generating word
clouds. The package, called word_cloud
was developed by Andreas Mueller. You can learn more about the package by following this link.
Let's use this package to learn how to generate a word cloud for a given text document.
First, let's install the package.
Word
clouds are commonly used to perform high-level analysis and visualization of text data. Accordinly, let's digress from the immigration dataset and work with an example that involves analyzing text data. Let's try to analyze a short novel written by Lewis Carroll titled Alice's Adventures in Wonderland. Let's go ahead and download a .txt file of the novel.
Next, let's use the stopwords that we imported from word_cloud
. We use the function set to remove any redundant stopwords.
Create a word cloud object and generate a word cloud. For simplicity, let's generate a word cloud using only the first 2000 words in the novel.
Awesome! Now that the word
cloud is created, let's visualize it.
Interesting! So in the first 2000 words in the novel, the most common words are Alice, said, little, Queen, and so on. Let's resize the cloud so that we can see the less frequent words a little better.
Much better! However, said isn't really an informative word. So let's add it to our stopwords and re-generate the cloud.
Excellent! This looks really interesting! Another cool thing you can implement with the word_cloud
package is superimposing the words onto a mask of any shape. Let's use a mask of Alice and her rabbit. We already created the mask for you, so let's go ahead and download it and call it alice_mask.png.
Let's take a look at how the mask looks like.
Shaping the word
cloud according to the mask is straightforward using word_cloud
package. For simplicity, we will continue using the first 2000 words in the novel.
Really impressive!
Unfortunately, our immigration data does not have any text data, but where there is a will there is a way. Let's generate sample text data from our immigration dataset, say text data of 90 words.
Let's recall how our data looks like.
And what was the total immigration from 1980 to 2013?
Using countries with single-word names, let's duplicate each country's name based on how much they contribute to the total immigration.
We are not dealing with any stopwords here, so there is no need to pass them when creating the word cloud.
According to the above word cloud, it looks like the majority of the people who immigrated came from one of 15 countries that are displayed by the word cloud. One cool visual that you could build, is perhaps using the map of Canada and a mask and superimposing the word cloud on top of the map of Canada. That would be an interesting visual to build!
In lab Pie Charts, Box Plots, Scatter Plots, and Bubble Plots, we learned how to create a scatter plot and then fit a regression line. It took ~20 lines of code to create the scatter plot along with the regression fit. In this final section, we will explore seaborn and see how efficient it is to create regression lines and fits using this library!
Let's first install seaborn
Create a new dataframe that stores that total number of landed immigrants to Canada per year from 1980 to 2013.
With seaborn, generating a regression plot is as simple as calling the regplot function.
This is not magic; it is seaborn! You can also customize the color of the scatter plot and regression line. Let's change the color to green.
You can always customize the marker shape, so instead of circular markers, let's use +
.
Let's blow up the plot a little so that it is more appealing to the sight.
And let's increase the size of markers so they match the new size of the figure, and add a title and x- and y-labels.
And finally increase the font size of the tickmark labels, the title, and the x- and y-labels so they don't feel left out!
Amazing! A complete scatter plot with a regression fit with 5 lines of code only. Isn't this really amazing?
If you are not a big fan of the purple background, you can easily change the style to a white plain background.
Or to a white background with gridlines.
Question: Use seaborn to create a scatter plot with a regression line to visualize the total immigration from Denmark, Sweden, and Norway to Canada from 1980 to 2013.