Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-03/code/starter-code/starter-code-3.ipynb
1905 views
Kernel: Python 2

Lesson 3 Codealong

Instructor: Amy Roberts, PhD

#General imports from sklearn import datasets from sklearn import metrics import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline

Part 1. Basic Stats

Methods available include: .min() - Compute minimum value .max() - Compute maximum value .mean() - Compute mean value .median() - Compute median value .mode() - Compute mode value(s) .count() - Count the number of observations

Read in the examples

df = pd.DataFrame({'example1' : [18, 24, 17, 21, 24, 16, 29, 18], 'example2' : [75, 87, 49, 68, 75, 84, 98, 92], 'example3' : [55, 47, 38, 66, 56, 64, 44, 39] }) print df

Instructor example: Calculate the mean for each coloumn

df.mean()

Students: Calculate median, mode, max, min for example

Note: All answers should match your hand calculations

#maximum
#minimum
#median
#mode

Part 2. Box Plot

Instructor: Interquartile range

print "50% Quartile:" print df.quantile(.50) print "Median (red line of the box)" print df.median()
print"25% (bottome of the box)" print df.quantile(0.25) print"75% (top of the box)" print df.quantile(0.75)
df['example1'].plot(kind='box')

Student: Create plots for examples 2 and 3 and check the quartiles

What does the cross in example 2 represent?

Answer:

Part 3. Standard Deviation and Variance

Variance: The variance is how much the predictions for a given point vary between different realizations of the model.

Standard Deviation: The square root of the variance

<img(src='../../assets/images/biasVsVarianceImage.png', style="width: 30%; height: 30%")>

In Pandas

Methods include: .std() - Compute Standard Deviation .var() - Compute variance

Let's calculate variance by hand first.

<img(src='../../assets/images/samplevarstd.png', style="width: 50%; height: 50%")>

#example1 mean = df["example1"].mean() n= df["example1"].count() print df["example1"] print mean print n
# written out by hand for instructional purposes #if there is time, have the students refactor this to create a function to calculate variance for any dataset #find the squared distance from the mean obs0 = (18 - mean)**2 obs1 = (24 - mean)**2 obs2 = (17 - mean)**2 obs3 = (21 - mean)**2 obs4 = (24 - mean)**2 obs5 = (16 - mean)**2 obs6 = (29 - mean)**2 obs7 = (18 - mean)**2 print obs0, obs1, obs2, obs3, obs4, obs5, obs6, obs7 #sum each observation's squared distance from the mean numerator = obs0 + obs1 + obs2 + obs3 + obs4 + obs5 + obs6 +obs7 denominator = n - 1 variance = numerator/denominator print numerator print denominator print variance
# in pandas print "Variance" print df["example1"].var()

Students: Calculate the standard deviation by hand for each sample

Recall that the standard deviation is the square root of the variance.

#find the variance for each dataset
#calculate standard deviation by hand
#now do it with pandas!

Short Cut!

df.describe()

Student: Check understanding

Which value in the above table is the median?

Answer:

Part 4: Correlation

df.corr()