Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-03/code/solution-code/solution-code-3.ipynb
1905 views
Kernel: Python 2

Lesson 3 - Solutions

Instructor: Amy Roberts, PhD

#General imports from sklearn import datasets from sklearn import metrics import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline

Part 1. Basic Stats

Methods available include: .min() - Compute minimum value .max() - Compute maximum value .mean() - Compute mean value .median() - Compute median value .mode() - Compute mode value(s) .count() - Count the number of observations

Read in the examples

df = pd.DataFrame({'example1' : [18, 24, 17, 21, 24, 16, 29, 18], 'example2' : [75, 87, 49, 68, 75, 84, 98, 92], 'example3' : [55, 47, 38, 66, 56, 64, 44, 39] }) print df

Instructor example: Calculate the mean for each coloumn

df.mean()

Students: Calculate median, mode, max, min for the example

Note: All answers should match your hand calculations

df.max()
df.min()
df.median()
df.mode()

Part 2. Box Plot

Instructor: Interquartile range

print "50% Quartile:" print df.quantile(.50) print "Median (red line of the box)" print df.median()
print"25% (bottome of the box)" print df.quantile(0.25) print"75% (top of the box)" print df.quantile(0.75)
df['example1'].plot(kind='box')

Student: Create plots for examples 2 and 3 and check the quartiles

df.plot(kind="box")

What does the cross in example 2 represent?

Answer: an outlier

Part 3. Standard Deviation and Variance

Variance: The variance is how much the predictions for a given point vary between different realizations of the model.

Standard Deviation: Te square root of the variance

<img(src='../../assets/images/biasVsVarianceImage.png', style="width: 30%; height: 30%")>

In Pandas

Methods include: .std() - Compute Standard Deviation .var() - Compute variance

Let's calculate variance by hand first.

<img(src='../../assets/images/samplevarstd.png', style="width: 50%; height: 50%")>

#example1 mean = df["example1"].mean() n= df["example1"].count() print df["example1"] print mean print n
# written out by hand for instructional purposes #if there is time, have the students refactor this to create a function to calculate variance for any dataset #find the squared distance from the mean obs0 = (18 - mean)**2 obs1 = (24 - mean)**2 obs2 = (17 - mean)**2 obs3 = (21 - mean)**2 obs4 = (24 - mean)**2 obs5 = (16 - mean)**2 obs6 = (29 - mean)**2 obs7 = (18 - mean)**2 print obs0, obs1, obs2, obs3, obs4, obs5, obs6, obs7 #sum each observation's squared distance from the mean numerator = obs0 + obs1 + obs2 + obs3 + obs4 + obs5 + obs6 +obs7 denominator = n - 1 variance = numerator/denominator print numerator print denominator print variance
# in pandas print "Variance" print df["example1"].var()

Students: Calculate the standard deviation by hand for each sample

Recall that standard deviation is the square root of the variance.

df.var()
#standard deviation print "example 1 SD = ", (20.125**(0.5)) print "example 2 SD = ", (238.571429**(0.5)) print "example 3 SD = ", (116.125**(0.5))
#now with pandas df.std()

Short Cut!

df.describe()

Student: Check understanding

Which value in the above table is the median?

Answer: 50%

Part 4: Correlation

df.corr()