Path: blob/master/april_18/lessons/lesson-05/code/Data Visualization Lab.ipynb
1904 views
Famous Datasets
There are a number of datasets that any data scientist will be familiar with. We're going to use several today to practice data visualization.
Iris Dataset This data set was collected by the famous statistician R. Fisher in the 1930s. First read about the data set and its features.
Abalone Dataset Various measurements of a type of snail
Boston Housing Housing prices alone with various supplemental data, such as local crime rates
Seaborn also includes some additional datasets.
First let's try a some function plotting.
Read the lines in the previous example closely and make sure you understand what each line is doing. It's ok if you don't understand all the arguments, you'll pick those up as we go. Remember that you can also shift-tab inside a function to see all the argument options.
If you want your plots to pop out so you can resize them, use %matplotlib
to undo the effect of %matplotlib inline
Which columns are categorical, which are continuous?
Let's use seaborn's pairplot to get a quick look at the data
Matplotlib
Seaborn makes nice plots but offers less control over the results versus matplotlib (on which seaborn is based). Let's look at the following example.
Note that our axes are more nicely labeled when we manually set the names. Matplotlib has a ton of customizability: you can change point shapes and sizes, colors, axes ranges, font sizes, and just about anything else.
We can also invoke matplotlib via pandas directly from the data frame.
Exercise 1
We can make histograms in several ways. Make a histogram of "Sepal Width" from the Iris data set:
Using matplotlib's
plt.hist
Using pandas
df.plot.hist
Using seaborn's distplot
Seaborn automatically includes a curve fit, which is sometimes not wanted. Look up the keyword argument to turn off the curve fit. Also checkout this example from wikipedia.
Categorical Data
Seaborn has some nice functions to plot categorical data
Exercise 2
Read the seaborn page on categorical data above and make the following plots:
sepal_width by category with a boxplot and a swarmplot
petal_length by category with a violinplot and a swarmplot
Time Series plots
Matplotlib and Seaborn can make some nice plots associated to time series data. For example, we can make plots of running. The following data contains the monthly price of the ETF VTI (a stock market index fund) over time
Exercise
Make a plot that is composed of two plots, vertically stacked of:
The closing price
The volume
You can do this with matplotlib's gridspec.
You can also use multiple y-axes as follows:
We can also easily make smoothed curves by computing means over moving windows.
Compare to the visualizations here. You can always put more work into a visualization's aesthetics, so focus on accuracy and proper labelling at first.
Error bars and filled plots
Often we want to indicate that our data is noisy or contains measurement error. Let's construct a dataset.
Exercise
Modify the previous example to a 95% confidence interval (two standard deviations).
Try to make a similar plot with the Mauna Loa atmospheric carbon data set "co2_mm_mlo.txt"
Exercises
For each of the remaining data sets:
Abalone Dataset Various measurements of a type of snail
Boston Housing Housing prices alone with various supplemental data, such as local crime rates
Work through the following exercises:
Make a pairplot on a subset of four categories (if possible). Use the vars=["column1", "columnb", ..] to prevent seaborn from making too many plots
Pick two continuous variables and make a scatter plot with matplotlib, a density plot with seaborn, and a joint plot with seaborn
If there are any categorical variables, make boxplots and violin plots for each of the categorical variables
Make at least one plot that has dual-axes or two stacked plots
Feel free to try to make any other plots that might seem interesting! If so please share with the class.
Exercises
Pick one of the datasets available here, such as the exoplanets dataset planets.csv or the diet and exercise data set exercise.csv (or another). Practice the plots you learned above and try to make an awesome plot.
If you need some ideas on different types of plots, checkout:
Bokeh
Bokeh is another visualization library. There are many example notebooks -- pick one and work through it.