Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_03/code/solution-code/code_3 (done).ipynb
1904 views
Kernel: Python 3

Lesson 3 Code

Instructor: Amy Roberts, PhD

#General imports from sklearn import datasets from sklearn import metrics import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline

Part 1. Basic Stats

Methods available include: .min() - Compute minimum value .max() - Compute maximum value .mean() - Compute mean value .median() - Compute median value .mode() - Compute mode value(s) .count() - Count the number of observations

Read in the examples

df = pd.DataFrame({'example1' : [18, 24, 17, 21, 24, 16, 29, 18], 'example2' : [75, 87, 49, 68, 75, 84, 98, 92], 'example3' : [55, 47, 38, 66, 56, 64, 44, 39] }) print(df)
example1 example2 example3 0 18 75 55 1 24 87 47 2 17 49 38 3 21 68 66 4 24 75 56 5 16 84 64 6 29 98 44 7 18 92 39

Instructor example: Calculate the mean for each coloumn

df.mean()
example1 20.875 example2 78.500 example3 51.125 dtype: float64

Students: Calculate median, mode, max, min for example

Note: All answers should match your hand calculations

#maximum df.max()
example1 29 example2 98 example3 66 dtype: int64
#minimum df.min()
example1 16 example2 49 example3 38 dtype: int64
#median df.median()
example1 19.5 example2 79.5 example3 51.0 dtype: float64
#mode df.mode()

Part 2. Box Plot

Instructor: Interquartile range

print ("50% Quartile:") print (df.quantile(.50)) print ("Median (red line of the box)") print (df.median())
50% Quartile: example1 19.5 example2 79.5 example3 51.0 Name: 0.5, dtype: float64 Median (red line of the box) example1 19.5 example2 79.5 example3 51.0 dtype: float64
print("25% (bottome of the box)") print (df.quantile(0.25)) print("75% (top of the box)") print (df.quantile(0.75))
25% (bottome of the box) example1 17.75 example2 73.25 example3 42.75 Name: 0.25, dtype: float64 75% (top of the box) example1 24.00 example2 88.25 example3 58.00 Name: 0.75, dtype: float64
df['example1'].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0xb325c50>
Image in a Jupyter notebook

Student: Create plots for examples 2 and 3 and check the quartiles

df['example2'].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0xb4f4470>
Image in a Jupyter notebook
df['example3'].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0xb578048>
Image in a Jupyter notebook
df.plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0xb537080>
Image in a Jupyter notebook

What does the circle in example 2 represent?

Answer:

Part 3. Standard Deviation and Variance

Variance: The variance is how much the predictions for a given point vary between different realizations of the model.

Standard Deviation: The square root of the variance

<img(src='../../assets/images/biasVsVarianceImage.png', style="width: 30%; height: 30%")>

In Pandas

Methods include: .std() - Compute Standard Deviation .var() - Compute variance

Let's calculate variance by hand first.

<img(src='../../assets/images/samplevarstd.png', style="width: 50%; height: 50%")>

#example1 mean = df["example1"].mean() n= df["example1"].count() print (df["example1"]) print ('mean = ', mean) print ('n = ', n)
0 18 1 24 2 17 3 21 4 24 5 16 6 29 7 18 Name: example1, dtype: int64 mean = 20.875 n = 8
# written out by hand for instructional purposes #if there is time, have the students refactor this to create a function to calculate variance for any dataset #find the squared distance from the mean obs0 = (18 - mean)**2 obs1 = (24 - mean)**2 obs2 = (17 - mean)**2 obs3 = (21 - mean)**2 obs4 = (24 - mean)**2 obs5 = (16 - mean)**2 obs6 = (29 - mean)**2 obs7 = (18 - mean)**2 print (obs0, obs1, obs2, obs3, obs4, obs5, obs6, obs7) #sum each observation's squared distance from the mean numerator = obs0 + obs1 + obs2 + obs3 + obs4 + obs5 + obs6 +obs7 denominator = n - 1 variance = numerator/denominator print (numerator) print (denominator) print (variance)
8.265625 9.765625 15.015625 0.015625 9.765625 23.765625 66.015625 8.265625 140.875 7 20.125
# in pandas print ("Variance") print (df["example1"].var())
Variance 20.125

Students: Calculate the standard deviation by hand for each sample

Recall that the standard deviation is the square root of the variance.

df.var()
example1 20.125000 example2 238.571429 example3 116.125000 dtype: float64
#standard deviation print("example 1 SD = ", (20.125**(0.5))) print("example 2 SD = ", (238.571429**(0.5))) print("example 3 SD = ", (116.125**(0.5)))
example 1 SD = 4.4860896112315904 example 2 SD = 15.445757637616873 example 3 SD = 10.776131031126154
#now with pandas df.std()
example1 4.486090 example2 15.445758 example3 10.776131 dtype: float64
df.head() df.describe() df.shape df.dtypes df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8 entries, 0 to 7 Data columns (total 3 columns): example1 8 non-null int64 example2 8 non-null int64 example3 8 non-null int64 dtypes: int64(3) memory usage: 272.0 bytes
#find the variance for each dataset a = df["example1"].var() b = df["example2"].var() c = df["example3"].var() #df.var()
#calculate standard deviation by hand a_std = np.sqrt(a) b_std = b**(1/2) c_std = np.sqrt(c) print(a_std, b_std, c_std)
4.4860896112315904 15.44575762374344 10.776131031126154
#now do it with pandas! df.std()
example1 4.486090 example2 15.445758 example3 10.776131 dtype: float64

Short Cut!

df.describe()

Student: Check understanding

Which value in the above table is the median?

Answer:

Part 4: Correlation

df.corr()
import seaborn as sns sns.heatmap(df.corr(), cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0xb9701d0>
Image in a Jupyter notebook
sns.pairplot(df.corr())
<seaborn.axisgrid.PairGrid at 0xba0d828>
Image in a Jupyter notebook
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0xb6612e8>
Image in a Jupyter notebook
df