Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Math 480 - Homework 6
Due 6pm on May 13, 2016
There are 5 problems. All problems have equal weight.
There are 4 pandas.
Problem 1 -- Your CSV file
(1.a) Search for a CSV dataset online (google: "[a keyword] filetype:csv") and load it into pandas. Make sure, it contains at least one column with numbers!
(1.b) Load the file as a Pandas dataframe, and compute the sum, mean, max, min, etc. of columns with numbers (use the describe method on a dataframe).
(1.c) Use a command from the Pandas visualization tools to draw at least one plot that illustrates your data.
Problem 2 -- Creating/Importing Different types of files
This problem is very similar to problem 1, but with more file types (and they are smaller). Pandas can import many types of files, including CSV files, excel spreadsheets, and much more.
(2.a) Find or create small example files (each should have at least 3 rows) in any way you want:
prob2.csv
-- a CSV fileprob2.json
-- a JSON file (hint: you can make json files using the json Python module)prob2.xlsx
-- an excel spreadsheet (hint: use google docs to make one)prob2.h5
-- an HDF file (hint: create such a file using pandas; e.g., see HDFStore docs)
(2.b) Read each of the files above in as Pandas data frames, compute summary statistics about them (with describe), and draw one plot (of your choosing) to illustrate something about the data.

Let sunspots
be the sunactivity dataframe (defined below for you).
(3.a) For how many years was the activity ? (Hint: how to get from a list/array of objects to the number of elements in that list/array?)
(3.b) Plot a histogram of all activity values beginning with the year 1900.
(3.c) Which year(s) had the highest activity?
Problem 4 -- Iris flowers


All statstic students learn about the extremely famous iris dataset! It lists the various sizes of petals and tries to classify them.
(4.a) Load the iris data set and use describe to see basic statistics about it. Hint: from statsmodels import datasets iris = datasets.get_rdataset("iris").data
(4.b) Plot all of the sepal (length, width) pairs in a scatterplot, and then plot the petal (length, width) pairs in another scatterplot.
(4.c) Compute the average petal width for each of the "species"-categories.
Problem 5 -- Pivot Tables

Pandas has a very powerful pd.pivot_table
function. See also http://pbpython.com/pandas-pivot-table-explained.html
Load the miles per gallon data set, which has both numerical and categorical columns:
You will then compute pivot tables, where you aggregate columns of your choice by sum, mean, min or max by category.
(5.a) Create a pandas data frame using the
pd.pivot_table
command that tells you the average "cty" and "hwy" (city and highway miles per gallon) for each manufacturer?(5.b) Has the average city mileage improved from 1999 to 2008? Has the average highway mileage improved from 1999 to 2008?
(5.c) Create a scatterplot of pairs (displ, hwy) for all cars in 1999, and another scatter plot for all cars in 2008. Roughly speaking, if you increase the card displacement, does the highway gas mileage go up or down?