Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-05/code/Data Visualization Lab.ipynb
1904 views
Kernel: Python 2

Famous Datasets

There are a number of datasets that any data scientist will be familiar with. We're going to use several today to practice data visualization.

Seaborn also includes some additional datasets.

import os import matplotlib.pyplot as plt import numpy as np import pandas as pd from scipy import stats import seaborn as sns # Make plots larger plt.rcParams['figure.figsize'] = (10, 6)

First let's try a some function plotting.

# Plot two normal distributions domain = np.arange(-20, 20, 0.1) values = stats.norm(3, 5).pdf(domain) plt.plot(domain, values, color='r', linewidth=2) plt.fill_between(domain, 0, values, color='r', alpha=0.3) values = stats.norm(4, 2).pdf(domain) plt.plot(domain, values, color='b', linewidth=2) plt.ylabel("Probability") plt.title("Two Normal Distributions") plt.show()

Read the lines in the previous example closely and make sure you understand what each line is doing. It's ok if you don't understand all the arguments, you'll pick those up as we go. Remember that you can also shift-tab inside a function to see all the argument options.

If you want your plots to pop out so you can resize them, use %matplotlib to undo the effect of %matplotlib inline

%matplotlib # Plot two normal distributions domain = np.arange(-20, 20, 0.1) values = stats.norm(3, 5).pdf(domain) plt.plot(domain, values, color='r', linewidth=2) plt.fill_between(domain, 0, values, color='r', alpha=0.3) values = stats.norm(4, 2).pdf(domain) plt.plot(domain, values, color='b', linewidth=2) plt.ylabel("Probability") plt.title("Two Normal Distributions") plt.show()
Using matplotlib backend: Qt4Agg
%matplotlib # Load the Iris Dataset df = pd.read_csv(os.path.join("datasets", "iris.data"), sep=',') df.head()
Using matplotlib backend: Qt4Agg
  • Which columns are categorical, which are continuous?

  • Let's use seaborn's pairplot to get a quick look at the data

# This special jupyter command causes plots to render in the notebook %matplotlib inline sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x7f21032b2250>
Image in a Jupyter notebook
# Seaborn can also color the data by category: sns.pairplot(df, hue="species")
<seaborn.axisgrid.PairGrid at 0x7f21029b9d50>
Image in a Jupyter notebook

Matplotlib

Seaborn makes nice plots but offers less control over the results versus matplotlib (on which seaborn is based). Let's look at the following example.

# Make a scatter plot plt.scatter(df["petal_length"], df["sepal_length"]) plt.xlabel("Petal Length") plt.ylabel("Sepal Length") plt.ylim(3, 9) plt.title("Iris Data Set") plt.show()
Image in a Jupyter notebook

Note that our axes are more nicely labeled when we manually set the names. Matplotlib has a ton of customizability: you can change point shapes and sizes, colors, axes ranges, font sizes, and just about anything else.

We can also invoke matplotlib via pandas directly from the data frame.

df.plot.scatter("petal_length", "sepal_length")
<matplotlib.axes._subplots.AxesSubplot at 0x7f20fbac0c10>
Image in a Jupyter notebook

Exercise 1

We can make histograms in several ways. Make a histogram of "Sepal Width" from the Iris data set:

  • Using matplotlib's plt.hist

  • Using pandas df.plot.hist

  • Using seaborn's distplot

Seaborn automatically includes a curve fit, which is sometimes not wanted. Look up the keyword argument to turn off the curve fit. Also checkout this example from wikipedia.

Categorical Data

Seaborn has some nice functions to plot categorical data

sns.boxplot(x="species", y="sepal_length", data=df)

Exercise 2

Read the seaborn page on categorical data above and make the following plots:

  • sepal_width by category with a boxplot and a swarmplot

  • petal_length by category with a violinplot and a swarmplot

Time Series plots

Matplotlib and Seaborn can make some nice plots associated to time series data. For example, we can make plots of running. The following data contains the monthly price of the ETF VTI (a stock market index fund) over time

df = pd.read_csv(os.path.join("datasets", "vti.csv")) df.sort_values(by="Date", inplace=True) df["Date"] = pd.to_datetime(df["Date"], format='%Y-%m-%d') df.head()
plt.plot(df["Date"], df["Open"], label="Open") plt.plot(df["Date"], df["Close"], label="Close") plt.xlabel("Date") plt.ylabel("Price") plt.title("VTI Monthly Prices") plt.legend() plt.show()
plt.plot(df["Date"], df["Open"] - df["Close"], label="Close-Open") plt.plot(df["Date"], df["High"] - df["Low"], label="High-Low") plt.xlabel("Date") plt.ylabel("Price") plt.title("VTI Monthly Prices") plt.show()

Exercise

Make a plot that is composed of two plots, vertically stacked of:

  • The closing price

  • The volume

You can do this with matplotlib's gridspec.

# Fill in the details import matplotlib.gridspec as gridspec gs = gridspec.GridSpec(2, 1) # rows and columns ax1 = plt.subplot(gs[0, 0]) ax2 = plt.subplot(gs[1, 0]) ax1.plot() ax2.plot()

You can also use multiple y-axes as follows:

# Fill in the details, see http://matplotlib.org/examples/api/two_scales.html fig, ax1 = plt.subplots() ax1.plot() ax1.set_ylabel("Closing Price") ax2 = ax1.twinx() ax2.plot() ax2.set_ylabel('Volume') ax2.set_xlabel('Date') plt.show()

We can also easily make smoothed curves by computing means over moving windows.

%matplotlib inline rolling_mean = df["Open"].rolling(window=10).mean() plt.plot(range(len(rolling_mean)), rolling_mean) plt.title("Smoothed VTI Price Data") plt.show()

Compare to the visualizations here. You can always put more work into a visualization's aesthetics, so focus on accuracy and proper labelling at first.

Error bars and filled plots

Often we want to indicate that our data is noisy or contains measurement error. Let's construct a dataset.

# Check: do you understand this code? import numpy as np from scipy import stats import random data = [] for i in range(50): m = random.randint(5 + i, 15 + i) s = random.randint(4, 8) dist = stats.norm(m, s) draws = dist.rvs(30) data.append([np.mean(draws), np.std(draws)]) df = pd.DataFrame(data, columns=["Mean", "Std"]) df.head()
# Now we can plot with error bars plt.errorbar(range(len(df)), df["Mean"], yerr=df["Std"]) plt.title("Error Bar Example")
# Confidence interval: 68% plt.errorbar(range(len(df)), df["Mean"]) lower = df["Mean"] - df["Std"] upper = df["Mean"] + df["Std"] plt.fill_between(range(len(df)), lower, upper, alpha=0.5) plt.title("CI Example")

Exercise

  • Modify the previous example to a 95% confidence interval (two standard deviations).

  • Try to make a similar plot with the Mauna Loa atmospheric carbon data set "co2_mm_mlo.txt"

import pandas as pd columns = ["year", "month", "decimal_date", "average", "iffrdcgfffffffffffffffffffffnterpolated", "trend", "days"] # df = pd.read_csv(os.path.join("datasets", "co2_mm_mlo.txt"), comment="#", # delim_whitespace=True, names=columns) df = pd.read_csv("datasets/co2_mm_mlo.txt", comment="#", delim_whitespace=True, names=columns) df.head()

Exercises

For each of the remaining data sets:

Work through the following exercises:

  • Make a pairplot on a subset of four categories (if possible). Use the vars=["column1", "columnb", ..] to prevent seaborn from making too many plots

  • Pick two continuous variables and make a scatter plot with matplotlib, a density plot with seaborn, and a joint plot with seaborn

  • If there are any categorical variables, make boxplots and violin plots for each of the categorical variables

  • Make at least one plot that has dual-axes or two stacked plots

Feel free to try to make any other plots that might seem interesting! If so please share with the class.

Exercises

Pick one of the datasets available here, such as the exoplanets dataset planets.csv or the diet and exercise data set exercise.csv (or another). Practice the plots you learned above and try to make an awesome plot.

If you need some ideas on different types of plots, checkout:

Bokeh

Bokeh is another visualization library. There are many example notebooks -- pick one and work through it.