CoCalc -- 2016-05-13.sagews

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

| Download

Project: math480-2016

Path: lectures / 2016-05-13 / 2016-05-13.sagews

Views: ³¹⁸³

2016-05-13

William Stein

Lectures 21: Stats (part 3 of 3)

Announcements:

Homework due at 6pm tonight.
Peer grading deadline extended until 6pm on Monday evening.
Next week: numpy
Today:
- statsmodels
- scikit-learn
- working session/QA.
REMIND ME: screencast
Here's a slideshow about the "data analysis and statistics in python", or why Python is (quickly getting) better than R.

statsmodels

"Python module that allows users to explore data, estimate statistical models, and perform statistical tests."

Documentation: http://statsmodels.sourceforge.net/stable/

%auto
%default_mode python
%typeset_mode True
import statsmodels.api as sm
import pandas
from patsy import dmatrices

1. The statsmodels "Getting started" tutorial!

We download the Guerry dataset, a collection of historical data used in support of Andre-Michel Guerry’s 1833 Essay on the Moral Statistics of France. The data set is hosted online in comma-separated values format (CSV) by the Rdatasets repository. We could download the file locally and then load it using read_csv, but pandas takes care of all of this automatically for us:

df = sm.datasets.get_rdataset("Guerry", "HistData").data   # this is a familiar Pandas dataframe
df.head()

	dept	Region	Department	Crime_pers	Crime_prop	Literacy	Donations	Infants	Suicides	MainCity	Wealth	Commerce	Clergy	Crime_parents	Infanticide	Donation_clergy	Lottery	Desertion	Instruction	Prostitutes	Distance	Area	Pop1831
0	1	E	Ain	28870	15890	37	5098	33120	35039	2:Med	73	58	11	71	60	69	41	55	46	13	218.372	5762	346.03
1	2	N	Aisne	26226	5521	51	8901	14572	12831	2:Med	22	10	82	4	82	36	38	82	24	327	65.945	7369	513.00
2	3	C	Allier	26747	7925	13	10973	17044	114121	2:Med	61	66	68	46	42	76	66	16	85	34	161.927	7340	298.26
3	4	E	Basses-Alpes	12935	7289	46	2733	23018	14238	1:Sm	76	49	5	70	12	37	80	32	29	2	351.399	6925	155.90
4	5	E	Hautes-Alpes	17488	8174	69	6962	23076	16171	1:Sm	83	65	10	22	23	64	79	35	7	1	320.280	5549	129.10

We select the variables of interest and look at the bottom 5 rows:

df = df[ ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region'] ]
df.tail(5)

	Department	Lottery	Literacy	Wealth	Region
81	Vienne	40	25	68	W
82	Haute-Vienne	55	13	67	C
83	Vosges	14	62	82	E
84	Yonne	51	47	30	C
85	Corse	83	49	37	NaN

Notice that there is one missing observation in the Region column. We eliminate it using a DataFrame method provided by pandas:

df = df.dropna()
df.tail(5)

	Department	Lottery	Literacy	Wealth	Region
80	Vendee	68	28	56	W
81	Vienne	40	25	68	W
82	Haute-Vienne	55	13	67	C
83	Vosges	14	62	82	E
84	Yonne	51	47	30	C

Some statistics...

Substantive motivation and model: We want to know whether literacy rates in the 86 French departments are associated with per capita wagers on the Royal Lottery in the 1820s. We need to control for the level of wealth in each department, and we also want to include a series of dummy variables on the right-hand side of our regression equation to control for unobserved heterogeneity due to regional effects. The model is estimated using ordinary least squares regression (OLS).

(WARNING: I'm not a statistician...)

Use patsy‘s to create design matrices, then use statsmodels to do an ordinary least squares fit.

y,X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')
mod = sm.OLS(y, X)    # Describe model
res = mod.fit()       # Fit model
res.summary()         # Summarize model

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Fri, 13 May 2016   Prob (F-statistic):           1.07e-05
Time:                        18:39:29   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept      38.6517      9.456      4.087      0.000        19.826    57.478
Region[T.E]   -15.4278      9.727     -1.586      0.117       -34.793     3.938
Region[T.N]   -10.0170      9.260     -1.082      0.283       -28.453     8.419
Region[T.S]    -4.5483      7.279     -0.625      0.534       -19.039     9.943
Region[T.W]   -10.0913      7.196     -1.402      0.165       -24.418     4.235
Literacy       -0.1858      0.210     -0.886      0.378        -0.603     0.232
Wealth          0.4515      0.103      4.390      0.000         0.247     0.656
==============================================================================
Omnibus:                        3.049   Durbin-Watson:                   1.785
Prob(Omnibus):                  0.218   Jarque-Bera (JB):                2.694
Skew:                          -0.340   Prob(JB):                        0.260
Kurtosis:                       2.454   Cond. No.                         371.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

statsmodels also provides graphics functions. For example, we can draw a plot of partial regression for a set of regressors by:

sm.graphics.plot_partregress('Lottery', 'Wealth', ['Region', 'Literacy'],
                              data=df, obs_labels=False)

2. Scikit Learn: Easy Machine Learning

From their website:

Machine Learning in Python
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license

from sklearn import datasets
digits = datasets.load_digits()

This "digits" thing is 1797 hand-written low-resolution numbers from the postcal code numbers on letters (zip codes)...

type(digits.data)

<type 'numpy.ndarray'>

digits.data

[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]

digits.target

[0 1 2 ..., 8 9 8]

len(digits.target)

\displaystyle 1797

import matplotlib.pyplot as plt
def plot_data(n):
    print "scan of %s"%digits.target[n]
    show(plt.matshow(digits.data[n].reshape(8,8), interpolation="nearest", cmap=plt.cm.Greys_r))

digits.data[0]

[  0.   0.   5.  13.   9.   1.   0.   0.   0.   0.  13.  15.  10.  15.   5.
  0.   3.  15.   2.   0.  11.   8.   0.   0.   4.  12.   0.   0.   8.
  0.   0.   5.   8.   0.   0.   9.   8.   0.   0.   4.  11.   0.   1.
  7.   0.   0.   2.  14.   5.  10.  12.   0.   0.   0.   0.   6.  13.
  0.   0.   0.]

plot_data(0)

scan of 0

plot_data(389)

scan of 3

Example/goal here:

"In the case of the digits dataset, the task is to predict, given an image, which digit it represents. We are given 1797 samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong."

from sklearn import svm     # use a support vector machine...
clf = svm.SVC(gamma=0.001, C=100.)
# We set params manually;, but scikit learn has techniques for finding these params automatically

plot_data(1796)

scan of 8

This clf is a "classifier". Right now it doesn't know anything at all about our actual data.

We teach it by passing in our data to the fit method...

"We use all the images of our dataset apart from the last one. We select this training set with the [:-1] Python syntax, which produces a new array that contains all but the last entry of digits.data:"

# Train our classifier by giving it 1796 scanned digits!
clf.fit(digits.data[:-1], digits.target[:-1])

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

plot_data(1796)  # the last one -- not given

scan of 8

/projects/sage/sage-6.10/local/lib/python2.7/site-packages/matplotlib-1.5.0-py2.7-linux-x86_64.egg/matplotlib/pyplot.py:516: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

digits.data[-1:]

[[  0.   0.  10.  14.   8.   1.   0.   0.   0.   2.  16.  14.   6.   1.
  0.   0.   0.  15.  15.   8.  15.   0.   0.   0.   0.   5.  16.
 10.   0.   0.   0.   0.  12.  15.  15.  12.   0.   0.   0.   4.
  6.   4.  16.   6.   0.   0.   8.  16.  10.   8.  16.   8.   0.
  1.   8.  12.  14.  12.   1.   0.]]

digits.target[1796]

\displaystyle 8

# it predicts it is an 8 correctly!
clf.predict(digits.data[-1:])

[8]

** Exercise for you right now: ** Run the prediction on all 1797 scaned values to see how many are correct.

# To get you started, here is how to test the 5 scalled value.
digits.target[5]
clf.predict(digits.data[5:6])

\displaystyle 5

[5]

n = 15
digits.target[n] == clf.predict(digits.data[n:n+1])[0]

True

# After you do this -- completely recreate clf but trained on way less data!
︠2b70db74-74a3-483e-bd1b-fc159738507as︠
plot_data(5)

scan of 5

bonus

plots of low dimensional embeddings of the digits dataset:

http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Math 480: Open Source Mathematical Software

2016-05-13

William Stein

Lectures 21: Stats (part 3 of 3)

statsmodels

1. The statsmodels "Getting started" tutorial!

2. Scikit Learn: Easy Machine Learning

bonus

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Math 480: Open Source Mathematical Software

2016-05-13

William Stein

**Lectures 21: Stats (part 3 of 3) **

statsmodels

1. The statsmodels "Getting started" tutorial!

2. Scikit Learn: Easy Machine Learning

bonus

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Lectures 21: Stats (part 3 of 3)