Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Math 480 - Homework 6
Due 6pm on May 13, 2016
There are 5 problems. All problems have equal weight.
Problem 1 -- Your CSV file
(1.a) Search for a CSV dataset online (google: "[a keyword] filetype:csv") and try to load it into pandas. Make sure, it contains at least one column with numbers!
(1.b) Load the file as a Pandas dataframe, and compute the sum, mean, max, min, etc. of columns with numbers (use the describe method on a dataframe).
(1.c) Use a command from the Pandas visualization tools to draw at least one plot that illustrates your data.
10 Points Total
Part A (2 points total)
All or nothing, there should be data loaded into pandas via csv.
Part B (4 points total)
Award:
- 4 points for using describe
- 3 points if the manually computed the values that describe should give.
- 2 points if their csv file does not load but they used describe correctly
Part C (4 points total)
Award:
- 4 points for plotting their data.
- 2 points if they plotted correctly but their csv file did not load.
(1.a) -- make some notes about how you did it here.
FID | PARK_ID | FACILITYID | |
---|---|---|---|
count | 20.00000 | 20.000000 | 20.000000 |
mean | 9.50000 | 869.750000 | 28009.900000 |
std | 5.91608 | 456.136623 | 6.348643 |
min | 0.00000 | 9.000000 | 28000.000000 |
25% | 4.75000 | 646.750000 | 28004.750000 |
50% | 9.50000 | 867.500000 | 28009.500000 |
75% | 14.25000 | 1157.500000 | 28015.250000 |
max | 19.00000 | 1717.000000 | 28020.000000 |
Problem 2 -- Creating/Importing Different types of files
This problem is very similar to problem 1, but with more file types (and they are smaller). Pandas can import many types of files, including CSV files, excel spreadsheets, and much more.
(2.a) Find or create small example files (each should have at least 3 rows) in any way you want:
prob2.csv
-- a CSV fileprob2.json
-- a JSON file (hint: you can make json files using the json Python module)prob2.xls
-- an excel spreadsheet (hint: use google docs to make one)prob2.h5
-- an HDF file (hint: create such a file using pandas; e.g., see HDFStore docs)
(2.b) Read each of the files above in as Pandas data frames, compute summary statistics about them (with describe), and draw one plot (of your choosing) to illustrate something about the data.
(2.a)
Explain how you got (or created) your data:
When you're done there should be files data.csv, data.json, data.xls, and data.h5 in the same directory as this worksheet.
10 Points Total
Part A (1 point for CSV, JSON, and xlsx. 2 points for h5; 5 points total)
Award (for each file type):
- Full points if they have the file and explained how they got it. (code to make the file counts as it's self explanitory)
- Half points if they're missing one of the requirements for the full point.
Part B (1 point per file type import and 1 point for a plot. 5 points total)
Award: (per file type).
- 1 point if they used read_XXXX(file_name) correctly and used describe.
- 0.5 points if did not use describe()
- 0 points if they did not read from file (eg. read from a variable)
Solution
Unnamed: 0 | age | |
---|---|---|
count | 5.000000 | 5.000000 |
mean | 2.000000 | 45.400000 |
std | 1.581139 | 18.460769 |
min | 0.000000 | 24.000000 |
25% | 1.000000 | 36.000000 |
50% | 2.000000 | 42.000000 |
75% | 3.000000 | 52.000000 |
max | 4.000000 | 73.000000 |
assignee | comments | id | number | |
---|---|---|---|---|
count | 0.0 | 5.000000 | 5.000000e+00 | 5.000000 |
mean | NaN | 2.600000 | 1.537063e+08 | 13117.400000 |
std | NaN | 4.722288 | 6.382248e+04 | 2.073644 |
min | NaN | 0.000000 | 1.536583e+08 | 13115.000000 |
25% | NaN | 0.000000 | 1.536615e+08 | 13116.000000 |
50% | NaN | 1.000000 | 1.536700e+08 | 13117.000000 |
75% | NaN | 1.000000 | 1.537364e+08 | 13119.000000 |
max | NaN | 11.000000 | 1.538051e+08 | 13120.000000 |
Time | 12:30 - 2:30 | 12:30 - 2:30.1 | 11:45 - 12:45 | 12:30 - 2:30.2 | 12:30 - 2:30.3 | |
---|---|---|---|---|---|---|
count | 2 | 4 | 4 | 3 | 4 | 4 |
unique | 2 | 4 | 4 | 3 | 4 | 4 |
top | Location | April 4th | April 5th | MGH | April 16th | Ryan |
freq | 1 | 1 | 1 | 1 | 1 | 1 |
A | B | C | |
---|---|---|---|
count | 8.000000 | 8.000000 | 8.000000 |
mean | 0.049540 | -0.545917 | 0.454165 |
std | 1.143521 | 1.066463 | 0.622934 |
min | -2.242685 | -2.021255 | -0.334077 |
25% | -0.398481 | -1.279958 | -0.038889 |
50% | 0.236777 | -0.678556 | 0.421265 |
75% | 0.883022 | 0.246429 | 0.913359 |
max | 1.321158 | 1.150036 | 1.432707 |

Let sunspots
be the sunactivity dataframe from the lecture.
(3.a) For how many years was the activity ? (Hint: how to get from a list/array of objects to the number of elements in that list/array?)
(3.b) Plot a histogram of all activity values beginning with the year 1900.
(3.c) Which year(s) had the highest activity?
10 Points Total
Part A (4 points)
Award:
- 4 points if they 'just got the answer' using Pandas functionality (should be short)
- 2 points if they used something like a for loop
- 0 points if they got the wrong answer
Part B (4 pointsl)
Award:
- 4 points for the correct histogram
- 2 points if they made a histogram but with the wrong subset of data
- 0 points for no histogram
Part C (2 pointsl)
Award: (per file type).
- 2 points if they gave the correct answer with relatively short code
- 1 point if they gave the correct answer but for looped through the data
- 0 points if they gave the wrong answer
SUNACTIVITY | |
---|---|
YEAR | |
1957.0 | 190.2 |
Problem 4 -- Iris flowers


All statstic students learn about the extremely famous iris dataset! It lists the various sizes of petals and tries to classify them.
(4.a) Load the iris data set and use describe to see basic statistics about it. Hint: from statsmodels import datasets; iris = datasets.get_rdataset("iris").data
(4.b) Plot all of the sepal (length, width) pairs in a scatterplot, and then plot the petal (length, width) pairs in another scatterplot.
(4.c) Compute the average petal width for each of the "species"-categories.
10 Points Total
Part A (1 points)
Award:
- 1 point for doing iris.describe()
Part B (2 points per plot, 4 points total)
Award: (per plot)
- 2 points for the correct plot with a simple solution (about 1 or 2 lines)
- 1 point for the correct plot with a more involved solution
- 0 points the wrong plot
Part C (5 points)
Award:
- 5 points for finding the correct average petal width for each species.
- 4 points for the correct widths without labeling the species.
- 2 points for the wrong answer but a reasonable attempt.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
Petal.Length | Petal.Width | Sepal.Length | Sepal.Width | ||
---|---|---|---|---|---|
Species | |||||
setosa | count | 50.000000 | 50.000000 | 50.000000 | 50.000000 |
mean | 1.462000 | 0.246000 | 5.006000 | 3.428000 | |
std | 0.173664 | 0.105386 | 0.352490 | 0.379064 | |
min | 1.000000 | 0.100000 | 4.300000 | 2.300000 | |
25% | 1.400000 | 0.200000 | 4.800000 | 3.200000 | |
50% | 1.500000 | 0.200000 | 5.000000 | 3.400000 | |
75% | 1.575000 | 0.300000 | 5.200000 | 3.675000 | |
max | 1.900000 | 0.600000 | 5.800000 | 4.400000 | |
versicolor | count | 50.000000 | 50.000000 | 50.000000 | 50.000000 |
mean | 4.260000 | 1.326000 | 5.936000 | 2.770000 | |
std | 0.469911 | 0.197753 | 0.516171 | 0.313798 | |
min | 3.000000 | 1.000000 | 4.900000 | 2.000000 | |
25% | 4.000000 | 1.200000 | 5.600000 | 2.525000 | |
50% | 4.350000 | 1.300000 | 5.900000 | 2.800000 | |
75% | 4.600000 | 1.500000 | 6.300000 | 3.000000 | |
max | 5.100000 | 1.800000 | 7.000000 | 3.400000 | |
virginica | count | 50.000000 | 50.000000 | 50.000000 | 50.000000 |
mean | 5.552000 | 2.026000 | 6.588000 | 2.974000 | |
std | 0.551895 | 0.274650 | 0.635880 | 0.322497 | |
min | 4.500000 | 1.400000 | 4.900000 | 2.200000 | |
25% | 5.100000 | 1.800000 | 6.225000 | 2.800000 | |
50% | 5.550000 | 2.000000 | 6.500000 | 3.000000 | |
75% | 5.875000 | 2.300000 | 6.900000 | 3.175000 | |
max | 6.900000 | 2.500000 | 7.900000 | 3.800000 |
Problem 5 -- Pivot Tables

Pandas has a very powerful pd.pivot_table
function. See also http://pbpython.com/pandas-pivot-table-explained.html
Load the miles per gallon data set, which has both numerical and categorical columns:
You will then compute pivot tables, where you aggregate columns of your choice by sum, mean, min or max by category.
(5.a) Create a pandas data frame using the
pd.pivot_table
command that tells you the average "cty" and "hwy" (city and highway miles per gallon) for each manufacturer?(5.b) Has the average city mileage improved from 1999 to 2008? Has the average highway mileage improved from 1999 to 2008?
(5.c) Create a scatterplot of pairs (displ, hwy) for all cars in 1999, and another scatter plot for all cars in 2008. Roughly speaking, if you increase the card displacement, does the highway gas mileage go up or down?
10 Points Total
Deduct 2 points from each part where they use an involved solution (eg. for-looping) For parts A and B, deduct 1 point if they did not use a pivot table.Part A (4 points)
Award:
- 4 points for a correct table showing just the relevant, requested information.
- 3 points for a correct table showing extra information
Part B (4 points total)
Award:
- 4 points for a correct table showing just the relevant, requested information.
- 3 points for a correct table showing extra information
Part C (1 points per plot, 2 points total)
No pivot table is required for this part.
Award: (per plot)
- 1 point for the correct plot with a simple solution (about 1 or 2 lines)
- 0.5 points for the correct plot with a more involved solution
- 0 points the wrong plot
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
4 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
manufacturer | audi | chevrolet | dodge | ford | honda | hyundai | jeep | land rover | lincoln | mercury | nissan | pontiac | subaru | toyota | volkswagen |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cty | 17.611111 | 15.000000 | 13.135135 | 14.00 | 24.444444 | 18.642857 | 13.500 | 11.5 | 11.333333 | 13.25 | 18.076923 | 17.0 | 19.285714 | 18.529412 | 20.925926 |
hwy | 26.444444 | 21.894737 | 17.945946 | 19.36 | 32.555556 | 26.857143 | 17.625 | 16.5 | 17.000000 | 18.00 | 24.615385 | 26.4 | 25.571429 | 24.911765 | 29.222222 |
year | 1999 | 2008 |
---|---|---|
cty | 17.017094 | 16.700855 |
hwy | 23.427350 | 23.452991 |