Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download
Views: 16659
%md # Math 152: Intro to Mathematical Software <img src="https://natgeoeducationblog.files.wordpress.com/2016/09/embarrassment-of-pandas.jpg" class="pull-right" width=300> ### 2017-02-15 ### Kiran Kedlaya; University of California, San Diego ### adapted from lectures by William Stein, University of Washington ### **Lecture 16: Pandas (part 2)**

Math 152: Intro to Mathematical Software

### 2017-02-15 ### Kiran Kedlaya; University of California, San Diego ### adapted from lectures by William Stein, University of Washington

Lecture 16: Pandas (part 2)

Announcements:

  1. Peer reviews due Thursday 8pm. Request from the graders: please place all feedback in a file called "grades.txt", to make it easier to find. (If one of your assigned folders is empty, please still make "grades.txt" and report that you found an empty folder.)

  2. Next Monday is a University holiday, so no lecture, sections, or office hours. Zonglin's office hours are rescheduled to Tuesday 10-12 (and also replace my usual Tuesday office hour).

Today: more pandas...

  • getting data into SMC

  • Series

  • DataFrames

%auto import pandas as pd pd.__version__ import numpy as np import matplotlib.pyplot as plt import matplotlib matplotlib.style.use('ggplot') %default_mode python # avoid Sage data types like Integer, which cause trouble (due to the Sage preparser). %typeset_mode True
u'0.21.0'

1. Getting data into SMC

  • WARNING: SMC has absolutely no direct access to the files on your laptop. It is a remote computer account running in the cloud. That said, if you use pandas/python, etc, on your own computer at some point, it will of course have direct access. (Maybe do a live demo of python get-pip.py then pip install --user pandas on my Linux laptop.)

  • +New, paste the link into the box, click "Download from Internet".

  • +New, drag and drop, upload a file.

  • Use wget or curl from the terminal or in a %sh cell.

  • Use scp or rsync (see project settings... SSH button) from your computer to the project.

  • Note: the last two methods require your project to have an Internet upgrade. For this course, that is provided by UCSD.

The %sh command simulates a Unix (or Mac) terminal from within a worksheet. As long as a command isn't interactive, it should work for you.

%sh wget http://spatialkeydocs.s3.amazonaws.com/FL_insurance_sample.csv.zip
--2017-02-15 22:14:38-- http://spatialkeydocs.s3.amazonaws.com/FL_insurance_sample.csv.zip Resolving spatialkeydocs.s3.amazonaws.com (spatialkeydocs.s3.amazonaws.com)... 54.231.40.83 Connecting to spatialkeydocs.s3.amazonaws.com (spatialkeydocs.s3.amazonaws.com)|54.231.40.83|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 866400 (846K) [application/zip] Saving to: ‘FL_insurance_sample.csv.zip’ FL_insurance_sample 100%[=====================>] 846.09K 2.07MB/s in 0.4s 2017-02-15 22:14:39 (2.07 MB/s) - ‘FL_insurance_sample.csv.zip’ saved [866400/866400]
%sh ls
2017-02-15-140805        FL_insurance_sample.csv.zip  index.html   nyt.png
2017-02-15.sagews        __MACOSX                     index.shtml
FL_insurance_sample.csv  animals.csv                  nikolai.jpg
%sh unzip FL_insurance_sample.csv.zip
Archive: FL_insurance_sample.csv.zip inflating: FL_insurance_sample.csv creating: __MACOSX/ inflating: __MACOSX/._FL_insurance_sample.csv
df = pd.read_csv('FL_insurance_sample.csv') df.head(10)
policyID statecode county eq_site_limit hu_site_limit fl_site_limit fr_site_limit tiv_2011 tiv_2012 eq_site_deductible hu_site_deductible fl_site_deductible fr_site_deductible point_latitude point_longitude line construction point_granularity
0 119736 FL CLAY COUNTY 498960.0 498960.00 498960.0 498960.0 498960.00 792148.90 0.0 9979.2 0.0 0 30.102261 -81.711777 Residential Masonry 1
1 448094 FL CLAY COUNTY 1322376.3 1322376.30 1322376.3 1322376.3 1322376.30 1438163.57 0.0 0.0 0.0 0 30.063936 -81.707664 Residential Masonry 3
2 206893 FL CLAY COUNTY 190724.4 190724.40 190724.4 190724.4 190724.40 192476.78 0.0 0.0 0.0 0 30.089579 -81.700455 Residential Wood 1
3 333743 FL CLAY COUNTY 0.0 79520.76 0.0 0.0 79520.76 86854.48 0.0 0.0 0.0 0 30.063236 -81.707703 Residential Wood 3
4 172534 FL CLAY COUNTY 0.0 254281.50 0.0 254281.5 254281.50 246144.49 0.0 0.0 0.0 0 30.060614 -81.702675 Residential Wood 1
5 785275 FL CLAY COUNTY 0.0 515035.62 0.0 0.0 515035.62 884419.17 0.0 0.0 0.0 0 30.063236 -81.707703 Residential Masonry 3
6 995932 FL CLAY COUNTY 0.0 19260000.00 0.0 0.0 19260000.00 20610000.00 0.0 0.0 0.0 0 30.102226 -81.713882 Commercial Reinforced Concrete 1
7 223488 FL CLAY COUNTY 328500.0 328500.00 328500.0 328500.0 328500.00 348374.25 0.0 16425.0 0.0 0 30.102217 -81.707146 Residential Wood 1
8 433512 FL CLAY COUNTY 315000.0 315000.00 315000.0 315000.0 315000.00 265821.57 0.0 15750.0 0.0 0 30.118774 -81.704613 Residential Wood 1
9 142071 FL CLAY COUNTY 705600.0 705600.00 705600.0 705600.0 705600.00 1010842.56 14112.0 35280.0 0.0 0 30.100628 -81.703751 Residential Masonry 1
df.describe()
policyID eq_site_limit hu_site_limit fl_site_limit fr_site_limit tiv_2011 tiv_2012 eq_site_deductible hu_site_deductible fl_site_deductible fr_site_deductible point_latitude point_longitude point_granularity
count 36634.000000 3.663400e+04 3.663400e+04 3.663400e+04 3.663400e+04 3.663400e+04 3.663400e+04 3.663400e+04 3.663400e+04 36634.000000 36634.000000 36634.000000 36634.000000 36634.000000
mean 548661.937189 7.314780e+05 2.074348e+06 6.646009e+05 9.911724e+05 2.172875e+06 2.571004e+06 7.787908e+02 7.037985e+03 192.453210 26.483594 28.087477 -81.903565 1.640907
std 259912.926494 1.752300e+07 1.964150e+07 1.746114e+07 1.767178e+07 2.015015e+07 2.036777e+07 4.787979e+04 9.582593e+04 8037.713716 4702.740932 1.647734 1.809066 1.070561
min 100074.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 9.000000e+01 7.337000e+01 0.000000e+00 0.000000e+00 0.000000 0.000000 24.547514 -87.447290 1.000000
25% 323647.750000 0.000000e+00 4.248106e+04 0.000000e+00 0.000000e+00 4.562415e+04 5.401351e+04 0.000000e+00 0.000000e+00 0.000000 0.000000 26.471996 -82.439331 1.000000
50% 548525.000000 0.000000e+00 1.926911e+05 0.000000e+00 0.000000e+00 2.021051e+05 2.416310e+05 0.000000e+00 0.000000e+00 0.000000 0.000000 28.057077 -81.585700 1.000000
75% 774252.000000 0.000000e+00 9.163381e+05 0.000000e+00 5.741010e+04 9.465821e+05 1.128871e+06 0.000000e+00 0.000000e+00 0.000000 0.000000 29.623005 -80.372910 3.000000
max 999971.000000 2.160000e+09 2.160000e+09 2.160000e+09 2.160000e+09 2.160000e+09 1.701000e+09 6.273765e+06 7.380000e+06 450000.000000 900000.000000 30.989820 -80.033257 7.000000

pd.Series

http://pandas.pydata.org/pandas-docs/version/0.18.1/dsintro.html#series


  • Datasets are built out of pd.Series objects, which are like a List of values, but with a specific type (like floating point number, integer, string, ...) and an index. More precisely, a pd.Series is a "One-dimensional ndarray with axis labels"

  • Larger pd.DataFrame objects are built out of several such series. Each of them is a column, must have the same length, and each column has a name.

Examples:

names = pd.Series(["Bob", "Marlene", "Joe", "Jane"]) age = pd.Series([55, 18, 71, 22])
names
0 Bob 1 Marlene 2 Joe 3 Jane dtype: object
age
0 55 1 18 2 71 3 22 dtype: int64
s = pd.Series(['john', 'private', 389, 'nathan'], index=['name', 'rank', 'number', 'name']) s
name john rank private number 389 name nathan dtype: object
s['rank']
private
s['name']
name john name nathan dtype: object
t = pd.Series([39, 'sergeant'], index=['number', 'rank']) s + t
name NaN name NaN number 428 rank privatesergeant dtype: object

Indexing...

age
0 55 1 18 2 71 3 22 dtype: int64
age[2]
71\displaystyle 71
age[:-1]
0 55 1 18 2 71 dtype: int64
age[age > 30]
0 55 2 71 dtype: int64
age[age > 30][2]
71\displaystyle 71
age[age > 30][age < 60]
0 55 dtype: int64

... so a pd.Series is much like a Python dict except you can do arithmetic, there is an order, you can have the same key multiple times, indexing is amazing. And everything is super fast.

Exercise right now!:

  • Make a Python dict d of your choosing, then try doing pd.Series(d). It should work.

  • Make up 3 different pd.Series objects with various choices of values and index.

  • Try adding them together and see what happens.

"50 to 80 percent of time is spent wading through the tedium of the first two steps – acquiring and wrangling data – before even getting to the real work of analysis and insight." -- New York Times

pd.DataFrame

Documentation: http://pandas.pydata.org/pandas-docs/version/0.18.1/dsintro.html#dataframe

DataFrame = "a 2-dimensional labeled data structure with columns of potentially different types."

Think of it as:

  • a spreadsheet, or

  • a SQL table, or

  • a dict of pd.Series objects.

Now, we build a pd.DataFrame:

names = pd.Series(["William", "Marlene", "Jon", "Jane"]) age = pd.Series([42, 68, 19, 22]) # We are literally making this data frame from a dictionary of series objects! # (This gives us no control over the order of the columns though...) people = pd.DataFrame({"name": names, "age": age}) people
age name
0 42 William
1 68 Marlene
2 19 Jon
3 22 Jane
# Explicit control of column order: pd.DataFrame({"name": names, "age": age}, columns=['name', 'age', 'location'])
name age location
0 William 42 NaN
1 Marlene 68 NaN
2 Jon 19 NaN
3 Jane 22 NaN

The bold numbers or values on the left are called the "index".

The index is used to uniquely identify a row.

Use people.ix[i] to get the row with given index!

This isn't obvious. You just have to memorize it. Learn it now.

# get the row with index "1" -- which people.ix[1]
age 68 name Marlene Name: 1, dtype: object

Of course, the rows don't have to be indexed by numbers -- anything hashable is allowed.

d = pd.DataFrame({'col1':{'name':'william', 'rank':'private'}, 'col2':{'name':'jon', 'rank':'general', 'ser':5077}}) d
col1 col2
name william jon
rank private general
ser NaN 5077
d.ix['rank']
col1 private col2 general Name: rank, dtype: object
d.ix['ser']
col1 NaN col2 5077 Name: ser, dtype: object

** Exercise right now!** Make a data frame for which there are two rows with the same index i, showing that the index need not be unique! [Hint: we made such a series above]

What happens when you type d.ix[i]?

d = pd.DataFrame({'col1':{'name':'william', 'rank':'private', 'i': 1, 'i':2}, 'col2':{'name':'jon', 'rank':'general', 'ser':5077, 'i':1, 'i':3}}) d
col1 col2
i 2 3
name william jon
rank private general
ser NaN 5077
s.ix['name']
name john name nathan dtype: object

NOTE: Besides .ix, there are also other properties for various purposes. Read more about them here:

s d = pd.DataFrame({'a': s,'b': s}) d d.ix['name']
name john rank private number 389 name nathan dtype: object
a b
name john john
rank private private
number 389 389
name nathan nathan
a b
name john john
name nathan nathan
d = pd.DataFrame({'col1':{'name':'william', 'rank':'private'}, 'col2':{'name':'jon', 'rank':'general', 'ser':5077}}) d
col1 col2
name william jon
rank private general
ser NaN 5077
d.loc[['name', 'rank']]
col1 col2
name william jon
rank private general
d.iloc[1]
col1 private col2 general Name: rank, dtype: object
d.iloc[0]
col1 william col2 jon Name: name, dtype: object
d.iloc[-1]
col1 NaN col2 5077 Name: ser, dtype: object

Question: How to get all rows starting from index 1 up to the end?

Answer: slicing.

people = pd.DataFrame({"name": names, "age": age}) people
age name
0 42 William
1 68 Marlene
2 19 Jon
3 22 Jane
people[:2]
age name
0 42 William
1 68 Marlene
people[1:3]
age name
1 68 Marlene
2 19 Jon
people[::2]
age name
0 42 William
2 19 Jon

CSV

CSV is a very simple column-oriented data format, where a text file contains rows and a comma is a delimiter for the columns.

see animals.csv

The first row contains the header, describing the columns.

Pandas then imports this file via pd.read_csv() (the argument of this function could not only be a filename, but also a URL link 'http://...' pointint to a csv file)

open("animals.csv",'w').write( """name,type,tail,legs,age,color joe,elephant,Y,4,16,gray tim,dog,Y,4,7,golden susan,cat,Y,4,4,black frank,fish,N,0,1,green olivia,spider,N,6,.5,black minki,cat,Y,4,6,white""")
%sh cat animals.csv
name,type,tail,legs,age,color joe,elephant,Y,4,16,gray tim,dog,Y,4,7,golden susan,cat,Y,4,4,black frank,fish,N,0,1,green olivia,spider,N,6,.5,black minki,cat,Y,4,6,white
animals = pd.read_csv("animals.csv") animals
name type tail legs age color
0 joe elephant Y 4 16.0 gray
1 tim dog Y 4 7.0 golden
2 susan cat Y 4 4.0 black
3 frank fish N 0 1.0 green
4 olivia spider N 6 0.5 black
5 minki cat Y 4 6.0 white
# access columns animals.name
0 joe 1 tim 2 susan 3 frank 4 olivia 5 minki Name: name, dtype: object

The power of Pandas is, that you can start asking questions about this data, filter it, compare it with other datasets, or even picture it in a plot very easily.

animals.describe()
legs age
count 6.000000 6.000000
mean 3.666667 5.750000
std 1.966384 5.654644
min 0.000000 0.500000
25% 4.000000 1.750000
50% 4.000000 5.000000
75% 4.000000 6.750000
max 6.000000 16.000000
# average age? animals.age.mean()
5.75\displaystyle 5.75
# which ones are younger than 6? animals[animals.age < 6]
name type tail legs age color
2 susan cat Y 4 4.0 black
3 frank fish N 0 1.0 green
4 olivia spider N 6 0.5 black
animals.ix[animals.legs == 4]
name type tail legs age color
0 joe elephant Y 4 16.0 gray
1 tim dog Y 4 7.0 golden
2 susan cat Y 4 4.0 black
5 minki cat Y 4 6.0 white

Question: which animal(s) has/have the largest number of legs?

animals.name.ix[animals.legs == animals.legs.max()]
4 olivia Name: name, dtype: object
animals.legs
0 4 1 4 2 4 3 0 4 6 5 4 Name: legs, dtype: int64
animals.legs.plot.hist()

Plotting

In a related library "statsmodels", there are also various datasets included. Try plotting one of them (where it makes sense)

%auto from statsmodels import datasets
x = datasets.statecrime d = x.load_pandas().data # accesses the pandas dataframe d.head(5) # .head or .tail give you the n-th rows at top or at the bottom
violent murder hs_grad poverty single white urban
state
Alabama 459.9 7.1 82.1 17.5 29.0 70.0 48.65
Alaska 632.6 3.2 91.4 9.0 25.5 68.3 44.46
Arizona 423.2 5.5 84.2 16.5 25.7 80.0 80.07
Arkansas 530.3 6.3 82.4 18.8 26.3 78.4 39.54
California 473.4 5.4 80.6 14.2 27.8 62.7 89.73
d.violent.plot.hist()