Project: Math 152 - Winter 2017

Path: lectures / 2017-02-15 / 2017-02-15.sagews

Views: ¹⁶⁶⁵⁹

%md
# Math 152: Intro to Mathematical Software

<img src="https://natgeoeducationblog.files.wordpress.com/2016/09/embarrassment-of-pandas.jpg" class="pull-right" width=300>
### 2017-02-15
### Kiran Kedlaya; University of California, San Diego
### adapted from lectures by William Stein, University of Washington

### **Lecture 16: Pandas (part 2)**

Math 152: Intro to Mathematical Software

### 2017-02-15 ### Kiran Kedlaya; University of California, San Diego ### adapted from lectures by William Stein, University of Washington

Lecture 16: Pandas (part 2)

Announcements:

Peer reviews due Thursday 8pm. Request from the graders: please place all feedback in a file called "grades.txt", to make it easier to find. (If one of your assigned folders is empty, please still make "grades.txt" and report that you found an empty folder.)
Next Monday is a University holiday, so no lecture, sections, or office hours. Zonglin's office hours are rescheduled to Tuesday 10-12 (and also replace my usual Tuesday office hour).

Today: more pandas...

getting data into SMC
Series
DataFrames

%auto
import pandas as pd
pd.__version__
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
%default_mode python    # avoid Sage data types like Integer, which cause trouble (due to the Sage preparser).
%typeset_mode True

u'0.21.0'

1. Getting data into SMC

WARNING: SMC has absolutely no direct access to the files on your laptop. It is a remote computer account running in the cloud. That said, if you use pandas/python, etc, on your own computer at some point, it will of course have direct access. (Maybe do a live demo of python get-pip.py then pip install --user pandas on my Linux laptop.)
+New, paste the link into the box, click "Download from Internet".
+New, drag and drop, upload a file.
Use wget or curl from the terminal or in a %sh cell.
Use scp or rsync (see project settings... SSH button) from your computer to the project.
Note: the last two methods require your project to have an Internet upgrade. For this course, that is provided by UCSD.

The %sh command simulates a Unix (or Mac) terminal from within a worksheet. As long as a command isn't interactive, it should work for you.

%sh
wget http://spatialkeydocs.s3.amazonaws.com/FL_insurance_sample.csv.zip

--2017-02-15 22:14:38--  http://spatialkeydocs.s3.amazonaws.com/FL_insurance_sample.csv.zip
Resolving spatialkeydocs.s3.amazonaws.com (spatialkeydocs.s3.amazonaws.com)... 54.231.40.83
Connecting to spatialkeydocs.s3.amazonaws.com (spatialkeydocs.s3.amazonaws.com)|54.231.40.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 866400 (846K) [application/zip]
Saving to: ‘FL_insurance_sample.csv.zip’

FL_insurance_sample 100%[=====================>] 846.09K  2.07MB/s   in 0.4s   

2017-02-15 22:14:39 (2.07 MB/s) - ‘FL_insurance_sample.csv.zip’ saved [866400/866400]

%sh
ls

2017-02-15-140805        FL_insurance_sample.csv.zip  index.html   nyt.png
2017-02-15.sagews        __MACOSX                     index.shtml
FL_insurance_sample.csv  animals.csv                  nikolai.jpg

%sh
unzip FL_insurance_sample.csv.zip

Archive:  FL_insurance_sample.csv.zip
  inflating: FL_insurance_sample.csv  
   creating: __MACOSX/
  inflating: __MACOSX/._FL_insurance_sample.csv  

df = pd.read_csv('FL_insurance_sample.csv')
df.head(10)

	policyID	statecode	county	eq_site_limit	hu_site_limit	fl_site_limit	fr_site_limit	tiv_2011	tiv_2012	eq_site_deductible	hu_site_deductible	point_latitude	point_longitude	line	construction	point_granularity
0	119736	FL	CLAY COUNTY	498960.0	498960.00	498960.0	498960.0	498960.00	792148.90	0.0	9979.2	30.102261	-81.711777	Residential	Masonry	1
1	448094	FL	CLAY COUNTY	1322376.3	1322376.30	1322376.3	1322376.3	1322376.30	1438163.57	0.0	0.0	30.063936	-81.707664	Residential	Masonry	3
2	206893	FL	CLAY COUNTY	190724.4	190724.40	190724.4	190724.4	190724.40	192476.78	0.0	0.0	30.089579	-81.700455	Residential	Wood	1
3	333743	FL	CLAY COUNTY	0.0	79520.76	0.0	0.0	79520.76	86854.48	0.0	0.0	30.063236	-81.707703	Residential	Wood	3
4	172534	FL	CLAY COUNTY	0.0	254281.50	0.0	254281.5	254281.50	246144.49	0.0	0.0	30.060614	-81.702675	Residential	Wood	1
5	785275	FL	CLAY COUNTY	0.0	515035.62	0.0	0.0	515035.62	884419.17	0.0	0.0	30.063236	-81.707703	Residential	Masonry	3
6	995932	FL	CLAY COUNTY	0.0	19260000.00	0.0	0.0	19260000.00	20610000.00	0.0	0.0	30.102226	-81.713882	Commercial	Reinforced Concrete	1
7	223488	FL	CLAY COUNTY	328500.0	328500.00	328500.0	328500.0	328500.00	348374.25	0.0	16425.0	30.102217	-81.707146	Residential	Wood	1
8	433512	FL	CLAY COUNTY	315000.0	315000.00	315000.0	315000.0	315000.00	265821.57	0.0	15750.0	30.118774	-81.704613	Residential	Wood	1
9	142071	FL	CLAY COUNTY	705600.0	705600.00	705600.0	705600.0	705600.00	1010842.56	14112.0	35280.0	30.100628	-81.703751	Residential	Masonry	1

df.describe()

	policyID	eq_site_limit	hu_site_limit	fl_site_limit	fr_site_limit	tiv_2011	tiv_2012	eq_site_deductible	hu_site_deductible	fl_site_deductible	fr_site_deductible	point_latitude	point_longitude	point_granularity
count	36634.000000	3.663400e+04	3.663400e+04	3.663400e+04	3.663400e+04	3.663400e+04	3.663400e+04	3.663400e+04	3.663400e+04	36634.000000	36634.000000	36634.000000	36634.000000	36634.000000
mean	548661.937189	7.314780e+05	2.074348e+06	6.646009e+05	9.911724e+05	2.172875e+06	2.571004e+06	7.787908e+02	7.037985e+03	192.453210	26.483594	28.087477	-81.903565	1.640907
std	259912.926494	1.752300e+07	1.964150e+07	1.746114e+07	1.767178e+07	2.015015e+07	2.036777e+07	4.787979e+04	9.582593e+04	8037.713716	4702.740932	1.647734	1.809066	1.070561
min	100074.000000	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	9.000000e+01	7.337000e+01	0.000000e+00	0.000000e+00	0.000000	0.000000	24.547514	-87.447290	1.000000
25%	323647.750000	0.000000e+00	4.248106e+04	0.000000e+00	0.000000e+00	4.562415e+04	5.401351e+04	0.000000e+00	0.000000e+00	0.000000	0.000000	26.471996	-82.439331	1.000000
50%	548525.000000	0.000000e+00	1.926911e+05	0.000000e+00	0.000000e+00	2.021051e+05	2.416310e+05	0.000000e+00	0.000000e+00	0.000000	0.000000	28.057077	-81.585700	1.000000
75%	774252.000000	0.000000e+00	9.163381e+05	0.000000e+00	5.741010e+04	9.465821e+05	1.128871e+06	0.000000e+00	0.000000e+00	0.000000	0.000000	29.623005	-80.372910	3.000000
max	999971.000000	2.160000e+09	2.160000e+09	2.160000e+09	2.160000e+09	2.160000e+09	1.701000e+09	6.273765e+06	7.380000e+06	450000.000000	900000.000000	30.989820	-80.033257	7.000000

`pd.Series`

http://pandas.pydata.org/pandas-docs/version/0.18.1/dsintro.html#series

Datasets are built out of pd.Series objects, which are like a List of values, but with a specific type (like floating point number, integer, string, ...) and an index. More precisely, a pd.Series is a "One-dimensional ndarray with axis labels"
Larger pd.DataFrame objects are built out of several such series. Each of them is a column, must have the same length, and each column has a name.

Examples:

names = pd.Series(["Bob", "Marlene", "Joe", "Jane"])
age   = pd.Series([55, 18, 71, 22])

names

      Bob
  Marlene
      Joe
     Jane
dtype: object

age

  55
  18
  71
  22
dtype: int64

s = pd.Series(['john', 'private', 389, 'nathan'], index=['name', 'rank', 'number', 'name'])
s

name         john
rank      private
number        389
name       nathan
dtype: object

s['rank']

private

s['name']

name      john
name    nathan
dtype: object

t = pd.Series([39, 'sergeant'], index=['number', 'rank'])
s + t

name                  NaN
name                  NaN
number                428
rank      privatesergeant
dtype: object

Indexing...

age

  55
  18
  71
  22
dtype: int64

age[2]

\displaystyle 71

age[:-1]

  55
  18
  71
dtype: int64

age[age > 30]

0    55
2    71
dtype: int64

age[age > 30][2]

\displaystyle 71

age[age > 30][age < 60]

0    55
dtype: int64

... so a pd.Series is much like a Python dict except you can do arithmetic, there is an order, you can have the same key multiple times, indexing is amazing. And everything is super fast.

Exercise right now!:

Make a Python dict d of your choosing, then try doing pd.Series(d). It should work.
Make up 3 different pd.Series objects with various choices of values and index.
Try adding them together and see what happens.

"50 to 80 percent of time is spent wading through the tedium of the first two steps – acquiring and wrangling data – before even getting to the real work of analysis and insight." -- New York Times

`pd.DataFrame`

Documentation: http://pandas.pydata.org/pandas-docs/version/0.18.1/dsintro.html#dataframe

DataFrame = "a 2-dimensional labeled data structure with columns of potentially different types."

Think of it as:

a spreadsheet, or
a SQL table, or
a dict of pd.Series objects.

Now, we build a pd.DataFrame:


names = pd.Series(["William", "Marlene", "Jon", "Jane"])
age   = pd.Series([42, 68, 19, 22])

# We are literally making this data frame from a dictionary of series objects!
# (This gives us no control over the order of the columns though...)
people = pd.DataFrame({"name": names, "age":  age})
people

	age	name
0	42	William
1	68	Marlene
2	19	Jon
3	22	Jane

# Explicit control of column order:
pd.DataFrame({"name": names, "age":  age}, columns=['name', 'age', 'location'])

	name	age	location
0	William	42	NaN
1	Marlene	68	NaN
2	Jon	19	NaN
3	Jane	22	NaN

The bold numbers or values on the left are called the "index".

The index is used to uniquely identify a row.

Use people.ix[i] to get the row with given index!

This isn't obvious. You just have to memorize it. Learn it now.

# get the row with index "1" -- which
people.ix[1]

age          68
name    Marlene
Name: 1, dtype: object

Of course, the rows don't have to be indexed by numbers -- anything hashable is allowed.

d = pd.DataFrame({'col1':{'name':'william', 'rank':'private'},
                  'col2':{'name':'jon', 'rank':'general', 'ser':5077}})
d

	col1	col2
name	william	jon
rank	private	general
ser	NaN	5077

d.ix['rank']

col1    private
col2    general
Name: rank, dtype: object

d.ix['ser']

col1     NaN
col2    5077
Name: ser, dtype: object

** Exercise right now!** Make a data frame for which there are two rows with the same index i, showing that the index need not be unique! [Hint: we made such a series above]

What happens when you type d.ix[i]?

d = pd.DataFrame({'col1':{'name':'william', 'rank':'private', 'i': 1, 'i':2},
                  'col2':{'name':'jon', 'rank':'general', 'ser':5077, 'i':1, 'i':3}})
d

	col1	col2
i	2	3
name	william	jon
rank	private	general
ser	NaN	5077

s.ix['name']

name      john
name    nathan
dtype: object

NOTE: Besides .ix, there are also other properties for various purposes. Read more about them here:

.ix, .loc, .iloc: http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing

s
d = pd.DataFrame({'a': s,'b': s})
d
d.ix['name']

name         john
rank      private
number        389
name       nathan
dtype: object

	a	b
name	john	john
rank	private	private
number	389	389
name	nathan	nathan

	a	b
name	john	john
name	nathan	nathan

d = pd.DataFrame({'col1':{'name':'william', 'rank':'private'},
                  'col2':{'name':'jon', 'rank':'general', 'ser':5077}})
d

	col1	col2
name	william	jon
rank	private	general
ser	NaN	5077

d.loc[['name', 'rank']]

	col1	col2
name	william	jon
rank	private	general

d.iloc[1]

col1    private
col2    general
Name: rank, dtype: object

d.iloc[0]

col1    william
col2        jon
Name: name, dtype: object

d.iloc[-1]

col1     NaN
col2    5077
Name: ser, dtype: object

Question: How to get all rows starting from index 1 up to the end?

Answer: slicing.

people = pd.DataFrame({"name": names, "age":  age})
people

	age	name
0	42	William
1	68	Marlene
2	19	Jon
3	22	Jane

people[:2]

	age	name
0	42	William
1	68	Marlene

people[1:3]

	age	name
1	68	Marlene
2	19	Jon

people[::2]

	age	name
0	42	William
2	19	Jon

CSV

CSV is a very simple column-oriented data format, where a text file contains rows and a comma is a delimiter for the columns.

see animals.csv

The first row contains the header, describing the columns.

Pandas then imports this file via pd.read_csv() (the argument of this function could not only be a filename, but also a URL link 'http://...' pointint to a csv file)

open("animals.csv",'w').write(
"""name,type,tail,legs,age,color
joe,elephant,Y,4,16,gray
tim,dog,Y,4,7,golden
susan,cat,Y,4,4,black
frank,fish,N,0,1,green
olivia,spider,N,6,.5,black
minki,cat,Y,4,6,white""")

%sh
cat animals.csv

name,type,tail,legs,age,color
joe,elephant,Y,4,16,gray
tim,dog,Y,4,7,golden
susan,cat,Y,4,4,black
frank,fish,N,0,1,green
olivia,spider,N,6,.5,black
minki,cat,Y,4,6,white

animals = pd.read_csv("animals.csv")
animals

	name	type	tail	legs	age	color
0	joe	elephant	Y	4	16.0	gray
1	tim	dog	Y	4	7.0	golden
2	susan	cat	Y	4	4.0	black
3	frank	fish	N	0	1.0	green
4	olivia	spider	N	6	0.5	black
5	minki	cat	Y	4	6.0	white

# access columns
animals.name

     joe
     tim
   susan
   frank
  olivia
   minki
Name: name, dtype: object

The power of Pandas is, that you can start asking questions about this data, filter it, compare it with other datasets, or even picture it in a plot very easily.

animals.describe()

	legs	age
count	6.000000	6.000000
mean	3.666667	5.750000
std	1.966384	5.654644
min	0.000000	0.500000
25%	4.000000	1.750000
50%	4.000000	5.000000
75%	4.000000	6.750000
max	6.000000	16.000000

# average age?
animals.age.mean()

\displaystyle 5.75

# which ones are younger than 6?
animals[animals.age < 6]

	name	type	tail	legs	age	color
2	susan	cat	Y	4	4.0	black
3	frank	fish	N	0	1.0	green
4	olivia	spider	N	6	0.5	black

animals.ix[animals.legs == 4]

	name	type	tail	legs	age	color
0	joe	elephant	Y	4	16.0	gray
1	tim	dog	Y	4	7.0	golden
2	susan	cat	Y	4	4.0	black
5	minki	cat	Y	4	6.0	white

Question: which animal(s) has/have the largest number of legs?

animals.name.ix[animals.legs == animals.legs.max()]

4    olivia
Name: name, dtype: object

animals.legs

  4
  4
  4
  0
  6
  4
Name: legs, dtype: int64

animals.legs.plot.hist()

Plotting

In a related library "statsmodels", there are also various datasets included. Try plotting one of them (where it makes sense)

%auto
from statsmodels import datasets

x = datasets.statecrime
d = x.load_pandas().data # accesses the pandas dataframe
d.head(5) # .head or .tail give you the n-th rows at top or at the bottom

	violent	murder	hs_grad	poverty	single	white	urban
state
Alabama	459.9	7.1	82.1	17.5	29.0	70.0	48.65
Alaska	632.6	3.2	91.4	9.0	25.5	68.3	44.46
Arizona	423.2	5.5	84.2	16.5	25.7	80.0	80.07
Arkansas	530.3	6.3	82.4	18.8	26.3	78.4	39.54
California	473.4	5.4	80.6	14.2	27.8	62.7	89.73

d.violent.plot.hist()