GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-03/code/solution-code/solution-code-3.ipynb
²³⁴⁶ views

Kernel: Python 2

Lesson 3 - Solutions

Instructor: Amy Roberts, PhD

In [ ]:

#General imports
from sklearn import datasets
from sklearn import metrics
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

Part 1. Basic Stats

Methods available include: 
	.min() - Compute minimum value
	.max() - Compute maximum value
	.mean() - Compute mean value
	.median() - Compute median value
    .mode() - Compute mode value(s)
	.count() - Count the number of observations

Read in the examples

In [ ]:

df = pd.DataFrame({'example1' : [18, 24, 17, 21, 24, 16, 29, 18], 'example2' : [75, 87, 49, 68, 75, 84, 98, 92], 'example3' : [55, 47, 38, 66, 56, 64, 44, 39] })
print df

Instructor example: Calculate the mean for each coloumn

In [ ]:

df.mean()

Students: Calculate median, mode, max, min for the example

Note: All answers should match your hand calculations

In [ ]:

df.max()

In [ ]:

df.min()

In [ ]:

df.median()

In [ ]:

df.mode()

Part 2. Box Plot

Instructor: Interquartile range

In [ ]:

print "50% Quartile:"
print df.quantile(.50) 
print "Median (red line of the box)"
print df.median()

In [ ]:

print"25% (bottome of the box)"
print df.quantile(0.25)
print"75% (top of the box)"
print df.quantile(0.75)

In [ ]:

df['example1'].plot(kind='box')

Student: Create plots for examples 2 and 3 and check the quartiles

In [ ]:

df.plot(kind="box")

What does the cross in example 2 represent?

Answer: an outlier

Part 3. Standard Deviation and Variance

Variance: The variance is how much the predictions for a given point vary between different realizations of the model.

Standard Deviation: Te square root of the variance

<img(src='../../assets/images/biasVsVarianceImage.png', style="width: 30%; height: 30%")>

In Pandas

Methods include: 
	.std() - Compute Standard Deviation
	.var() - Compute variance

Let's calculate variance by hand first.

<img(src='../../assets/images/samplevarstd.png', style="width: 50%; height: 50%")>

In [ ]:

#example1
mean = df["example1"].mean()
n= df["example1"].count()

print df["example1"]
print mean
print n

In [ ]:

# written out by hand for instructional purposes 
#if there is time, have the students refactor this to create a function to calculate variance for any dataset
#find the squared distance from the mean
obs0 = (18 - mean)**2
obs1 = (24 - mean)**2
obs2 = (17 - mean)**2
obs3 = (21 - mean)**2
obs4 = (24 - mean)**2
obs5 = (16 - mean)**2
obs6 = (29 - mean)**2
obs7 = (18 - mean)**2

print obs0, obs1, obs2, obs3, obs4, obs5, obs6, obs7

#sum each observation's squared distance from the mean 
numerator = obs0 + obs1 + obs2 + obs3 + obs4 + obs5 + obs6 +obs7
denominator = n - 1
variance = numerator/denominator
print numerator 
print denominator
print variance

In [ ]:

# in pandas
print "Variance"
print df["example1"].var()

Students: Calculate the standard deviation by hand for each sample

Recall that standard deviation is the square root of the variance.

In [ ]:

df.var()

In [ ]:

#standard deviation
print "example 1 SD = ", (20.125**(0.5))
print "example 2 SD = ", (238.571429**(0.5))
print "example 3 SD = ", (116.125**(0.5))

In [ ]:

#now with pandas
df.std()

Short Cut!

In [ ]:

df.describe()

Student: Check understanding

Which value in the above table is the median?

Answer: 50%

Part 4: Correlation

In [ ]:

df.corr()