GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_03/code/solution-code/code_3 (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Lesson 3 Code

Instructor: Amy Roberts, PhD

In [1]:

#General imports
from sklearn import datasets
from sklearn import metrics
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

Part 1. Basic Stats

Methods available include: 
	.min() - Compute minimum value
	.max() - Compute maximum value
	.mean() - Compute mean value
	.median() - Compute median value
    .mode() - Compute mode value(s)
	.count() - Count the number of observations

Read in the examples

In [2]:

df = pd.DataFrame({'example1' : [18, 24, 17, 21, 24, 16, 29, 18], 
                   'example2' : [75, 87, 49, 68, 75, 84, 98, 92], 
                   'example3' : [55, 47, 38, 66, 56, 64, 44, 39] })
print(df)

Out[2]:

   example1  example2  example3
      18        75        55
      24        87        47
      17        49        38
      21        68        66
      24        75        56
      16        84        64
      29        98        44
      18        92        39

Instructor example: Calculate the mean for each coloumn

In [3]:

df.mean()

Out[3]:

example1    20.875
example2    78.500
example3    51.125
dtype: float64

Students: Calculate median, mode, max, min for example

Note: All answers should match your hand calculations

In [4]:

#maximum
df.max()

Out[4]:

example1    29
example2    98
example3    66
dtype: int64

In [5]:

#minimum
df.min()

Out[5]:

example1    16
example2    49
example3    38
dtype: int64

In [6]:

#median
df.median()

Out[6]:

example1    19.5
example2    79.5
example3    51.0
dtype: float64

In [7]:

#mode
df.mode()

Out[7]:

Part 2. Box Plot

Instructor: Interquartile range

In [8]:

print ("50% Quartile:")
print (df.quantile(.50))
print ("Median (red line of the box)")
print (df.median())

Out[8]:

50% Quartile:
example1    19.5
example2    79.5
example3    51.0
Name: 0.5, dtype: float64
Median (red line of the box)
example1    19.5
example2    79.5
example3    51.0
dtype: float64

In [9]:

print("25% (bottome of the box)")
print (df.quantile(0.25))
print("75% (top of the box)")
print (df.quantile(0.75))

Out[9]:

25% (bottome of the box)
example1    17.75
example2    73.25
example3    42.75
Name: 0.25, dtype: float64
75% (top of the box)
example1    24.00
example2    88.25
example3    58.00
Name: 0.75, dtype: float64

In [10]:

df['example1'].plot(kind='box')

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0xb325c50>

Student: Create plots for examples 2 and 3 and check the quartiles

In [11]:

df['example2'].plot(kind='box')

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0xb4f4470>

In [12]:

df['example3'].plot(kind='box')

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0xb578048>

In [13]:

df.plot(kind='box')

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0xb537080>

What does the circle in example 2 represent?

Answer:

Part 3. Standard Deviation and Variance

Variance: The variance is how much the predictions for a given point vary between different realizations of the model.

Standard Deviation: The square root of the variance

<img(src='../../assets/images/biasVsVarianceImage.png', style="width: 30%; height: 30%")>

In Pandas

Methods include: 
	.std() - Compute Standard Deviation
	.var() - Compute variance

Let's calculate variance by hand first.

<img(src='../../assets/images/samplevarstd.png', style="width: 50%; height: 50%")>

In [14]:

#example1
mean = df["example1"].mean()
n= df["example1"].count()

print (df["example1"])
print ('mean = ', mean)
print ('n = ', n)

Out[14]:

  18
  24
  17
  21
  24
  16
  29
  18
Name: example1, dtype: int64
mean =  20.875
n =  8

In [15]:

# written out by hand for instructional purposes 
#if there is time, have the students refactor this to create a function to calculate variance for any dataset
#find the squared distance from the mean

obs0 = (18 - mean)**2
obs1 = (24 - mean)**2
obs2 = (17 - mean)**2
obs3 = (21 - mean)**2
obs4 = (24 - mean)**2
obs5 = (16 - mean)**2
obs6 = (29 - mean)**2
obs7 = (18 - mean)**2

print (obs0, obs1, obs2, obs3, obs4, obs5, obs6, obs7)

#sum each observation's squared distance from the mean 
numerator = obs0 + obs1 + obs2 + obs3 + obs4 + obs5 + obs6 +obs7
denominator = n - 1
variance = numerator/denominator
print (numerator)
print (denominator)
print (variance)

Out[15]:

265625 9.765625 15.015625 0.015625 9.765625 23.765625 66.015625 8.265625
875
7
125

In [16]:

# in pandas
print ("Variance")
print (df["example1"].var())

Out[16]:

Variance
20.125

Students: Calculate the standard deviation by hand for each sample

Recall that the standard deviation is the square root of the variance.

In [28]:

df.var()

Out[28]:

example1     20.125000
example2    238.571429
example3    116.125000
dtype: float64

In [29]:

#standard deviation
print("example 1 SD = ", (20.125**(0.5)))
print("example 2 SD = ", (238.571429**(0.5)))
print("example 3 SD = ", (116.125**(0.5)))

Out[29]:

example 1 SD =  4.4860896112315904
example 2 SD =  15.445757637616873
example 3 SD =  10.776131031126154

In [30]:

#now with pandas
df.std()

Out[30]:

example1     4.486090
example2    15.445758
example3    10.776131
dtype: float64

In [17]:

df.head()
df.describe()
df.shape
df.dtypes
df.info()

Out[17]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
example1    8 non-null int64
example2    8 non-null int64
example3    8 non-null int64
dtypes: int64(3)
memory usage: 272.0 bytes

In [18]:

#find the variance for each dataset
a = df["example1"].var()
b = df["example2"].var()
c = df["example3"].var()
#df.var()

In [19]:

#calculate standard deviation by hand
a_std = np.sqrt(a)
b_std = b**(1/2)
c_std = np.sqrt(c)

print(a_std, b_std, c_std)

Out[19]:

4.4860896112315904 15.44575762374344 10.776131031126154

In [20]:

#now do it with pandas!
df.std()

Out[20]:

example1     4.486090
example2    15.445758
example3    10.776131
dtype: float64

Short Cut!

In [21]:

df.describe()

Out[21]:

Student: Check understanding

Which value in the above table is the median?