GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_2/starter-code/Project - 2.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Project 2

Step 1: Load the python libraries you will need for this project

In [1]:

#imports
#from __future__ import division      - this is only for python 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
%matplotlib inline

Out[1]:

/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Step 2: Read in your data set

In [2]:

#Read in data from source 
df_raw = pd.read_csv("../assets/admissions.csv")
print(df_raw.head())

Out[2]:

   admit    gre   gpa  prestige
    0  380.0  3.61       3.0
    1  660.0  3.67       3.0
    1  800.0  4.00       1.0
    1  640.0  3.19       4.0
    0  520.0  2.93       4.0

In [15]:

list(df_raw.columns)

Out[15]:

['admit', 'gre', 'gpa', 'prestige']

Questions

Question 1. How many observations are in our dataset?

In [5]:

df_raw.count() # count() returns a series with the values being the number of observations and the index being the column name

Out[5]:

admit       400
gre         398
gpa         398
prestige    399
dtype: int64

In [4]:

df_raw.count().sum()

Out[4]:

1595

Answer: 1595

Question 2. Create a summary table

In [30]:

#function

In [6]:

df_raw.info()

Out[6]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
admit       400 non-null int64
gre         398 non-null float64
gpa         398 non-null float64
prestige    399 non-null float64
dtypes: float64(3), int64(1)
memory usage: 12.6 KB

In [6]:

df_raw.describe()

Out[6]:

Question 3. Why would GRE have a larger STD than GPA?

Answer: the measure of units is larger so the STD will be much larger. (2.2 min to 4.0 max vs. 220 min to 800 max)

A way to quantify this:

spread for gpa = 1.8, which proportionate to the lower bound of the range (1.8 / 2.2) is .818
spread for gre = 580, which proportionate to the lower bound of the range (580 / 220 is 290 / 110 is 145 / 55) is 2.64

so we have a relative spread of 82% versus 264%, so gre has roughly 3 times as much room to vary when normalizing each spread against its lower bound

In [7]:

cont = []
for i in range(1,21):
    cont.append(9/i)

In [11]:

cont = [9/i for i in range(1, 21)] #list comprehension, first what you are appending and then write the loop, and you can add an if at the end

In [22]:

cont

Out[22]:

[9.0,
5,
0,
25,
8,
5,
2857142857142858,
125,
0,
9,
8181818181818182,
75,
6923076923076923,
6428571428571429,
6,
5625,
5294117647058824,
5,
47368421052631576,
45]

In [23]:

ind = list(range(1, 21))

In [24]:

df = pd.DataFrame(ind)

In [27]:

df['cont'] = cont

In [29]:

df.columns = ['denominator', 'value']

In [35]:

df = df.set_index('denominator')

In [36]:

df

Out[36]:

in pandas, if you plot, it will automatically take the index and consider it the x axis and the values will be considered y.

In [41]:

ax = df.plot(figsize=[11,8]) #keyword arguments don't get a space
tit = plt.title('function: 9/x')

Out[41]:

this is non linear function
the rate of change (over this interval) is negative because the values are decreasing
the rate of change is the forst order derivative
the rate of change of the rate of change is not zero, because it will only be zero if the funtion was linear and the rate of change was constant, which in this case it is not
the rate of change of the rate of change is slowing down
if our rate of change is negative, what does it mean for it to be slowing down
it means that the rate of change is becoming less negative, how do you become less negative? through changing positively
so the rate of change of the rate of change, or the second order derivative, is positive.
i.e. this function is convex, as opposed to being concave (if the second order derivative is negative)
things to remember when dealing with derivatives:
position is where you are
velocity is the rate of change of your position
acceleration is the rate of change of your velocity

extrapolation is predciting a data point outside of the dataset. think extra. e.g. if today is the last point in the dataset and we want tomorrow's data, predicting that would be an extrapolation.
interpolation is predicting values within the confines of the dataset, e.g. filling in missing values
we have been trying to calc 9/11 using the known points 9/10 and 9/12, which is a non linear interpolation problem.
if it was a linear interpolation, we would take the average of 9/10 or .9 and 9/12 or .75 which is .15, divided by 2 = .075, so the mid point would be .75 + .075 or .9 - .075 = .825
we know that 9/11 is closer to .75 than .90, because the rate of change is more negative at the lower values.
we know the midpoint, which is .825 and we know that 9/11 is closer to 9/10, so the value of 9/11 has to be between .75 and .825 (this may be enough tollerance to stop but we will continue with one more cycle)
9/11 > 8/10 (see notes) so we know that .8 < 9/11 < .825
9/11 = .818

Question 4. Drop data points with missing data

In [53]:

df_raw.isnull().sum()

Out[53]:

admit       0
gre         2
gpa         2
prestige    1
dtype: int64

In [55]:

df_raw[df_raw.isnull().any(axis=1)]

Out[55]:

In [49]:

new_df = df_raw.dropna()

In [56]:

new_df2 = df_raw[~df_raw.isnull().any(axis=1)].copy()

In [57]:

new_df2.count()

Out[57]:

admit       397
gre         397
gpa         397
prestige    397
dtype: int64

Question 5. Confirm that you dropped the correct data. How can you tell?

In [58]:

new_df.count()

Out[58]:

admit       397
gre         397
gpa         397
prestige    397
dtype: int64

Answer: .count() gives the same result for all columns.

knowing that we started with different values in different columns (due to null values), seeing them line up with 397 values per, means we dropped the null value rows in the relevant columns.

three rows had null values which drops it from 400 to 397

Question 6. Create box plots for GRE and GPA

In [64]:

#boxplot 1
p_gre = plt.boxplot(new_df['gre'])

Out[64]:

In [65]:

p2_gre = plt.boxplot(new_df['gre'], whis=[5,95])

Out[65]:

In [63]:

#boxplot 2 
p_gpa = plt.boxplot(new_df['gpa'])

Out[63]:

In [66]:

p2_gpa = plt.boxplot(new_df['gpa'], whis=[0,100])

Out[66]:

Question 7. What do these plots show?

Answer:

the mean and std provide a measure of central tendency and dispersion. these measures can be very useful, however they are very prone to outliers. instead of aggregating the data and chopping it into equal pieces, which is what the arithmetic mean does, you can also rank your data. these rankings are generally much less prone to bias as a result of outliers.

with that in mind, the box plot provides us a nice way to visualize how our data is distributed with respect to its percentile rank. the box plot shows us the median as well as the quartile above and below that. the box represents the interquartile range, with its top being the 75th percentile and its bottom being the 25th percentile. the line in the middle is the 50th percentile or the median.

the whiskers in a box plot are genrally set up to represent some measurement of the edges of the data. these can represent the min and max of the data, they can also represent as is commonly the case, the 5th and 95th percentiles of the data.

in the case that the whiskers do not represent the min and max, the box plot will represent points outside of this as outliers.

matplotlib will default to putting the whiskers at 75th percentile + 1.5 x (75th percentile - 25th percentile) and 25th percentile - 1.5 x (75th percentile - 25th percentile).

you can set the whiskers to be at the 5th and 95th percentile, or the min and max (see examples above)

in this case, the gre and gpa box plots show left/negative skewness. there is a asymetrical dispersion with the higher values being more concentrated near the mean and the lower values being dispersed. this means that

gpa: 75% of people score 3.1 - 4.0 and the straggling 25% get much lower grades
gre: 75% of people score 520 and above, with 75% of people fitting in less space on the plot than the bottom 25%.
most people come very prepared for the gre or work hard for their gpa and there is an extranious variable that the 25% link to, such as resource problems, langauge problems or a learning disability.

Question 8. Describe each distribution

In [29]:

# plot the distribution of each variable

In [104]:

new_df['admit'].value_counts() #no need to hist or plot this, bc this is categorical, so value counts works better

Out[104]:

0    271
1    126
Name: admit, dtype: int64

In [83]:

new_df['admit'].value_counts() / new_df['admit'].value_counts().sum()

Out[83]:

0    0.68262
1    0.31738
Name: admit, dtype: float64

In [81]:

ax1 = new_df['admit'].value_counts().plot.barh()
title1 = ax1.set_title('admit')

Out[81]:

more people were rejected than admitted
32% acceptance rate and a 68% rejection rate
ratio of admitance to non-admitance is 1 to 2

In [87]:

ax2 = new_df['gre'].hist(figsize=(11,8), bins=40)

Out[87]:

In [99]:

new_df['gre'].mode()

Out[99]:

0    620.0
dtype: float64

this data is not continuous
this is left/negative skewed
there's a large concentration of people who get the perfect score with a small bucket who just miss that mark
the modal bucket is a little over 600

In [116]:

ax3 = new_df['gpa'].hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)

Out[116]:

this data looks to be more continuous than the previous set
this is left/negative skewed
there's a large concentration of people who get a perfect gpa
there are certain gpa's that are between the rounder numbers (.25 and .50) have a higher concentration of score
the modal bucket is 4.0

In [101]:

new_df['gpa'].mode()

Out[101]:

0    4.0
dtype: float64

In [105]:

new_df['prestige'].value_counts() #no need to hist or plot this, bc this is categorical, so value counts works better

Out[105]:

0    148
0    121
0     67
0     61
Name: prestige, dtype: int64

In [94]:

new_df['prestige'].value_counts() / new_df['prestige'].value_counts().sum()

Out[94]:

0    0.372796
0    0.304786
0    0.168766
0    0.153652
Name: prestige, dtype: float64

In [109]:

ax1 = new_df['prestige'].value_counts(
                       ).sort_index(ascending=False
                       ).plot.barh(figsize=(8, 5))
title1 = ax1.set_title('prestige')

Out[109]:

2 and 3 are the most common prestige scores, making up 67%
1 and 4 are almost equal in contribution to the whole @ ~15% each

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Answer:

for the categorical data such as admitance and prestige, no
as for the others, there's skewness to the left and accumulation of values at the extremes, which means no

Question 10. Does this distribution need correction? If so, why? How?

Answer:

yes, because the gre and gpa data have outliers and exhibit skewness
a log transformation could help reduce the skewness in the gpa and gre data sets

Question 11. Which of our variables are potentially colinear?

In [118]:

# create a correlation matrix for the data
new_df.corr()

Out[118]:

rough rules for correlation:

anything above 5% is probably tangible
anything above 10% is tangible

correlation matrix:

with correlation you are looking at co-movement between two variables
with categorical data this doesn't really make sense, especially for binary data such as admitance
with prestige for example, there is presumably an ordering where the higher the value the higher the "prestige"
but what does an increase of 1 in prestige mean, what is the unit? these are buckets...
even more so for admit, 1 could easily mean non-admit and 0 admit, the data would have the same structure and you could build an identical model to predict whether someone was admitted
there is no nuance or difference between someone who was alsmost admitted and someone who is no where near being admitted
whereas if you had continuos probability of admitance for example, looking at the correlation between that and gpa would be useful
thus correlation across categorical variables does not make much sense.
in this situation, leveraging relational data analysis techniques, e.g. looking at avg. gpa across the different prestige values would lead to much more useful information around how connected these variables are

Collinearity

collinearity in the concept of statistics refers to being able to predict with accuracy an explanatory variable in a dataset through a linear combinaton of the other explanatory variables.
i.e. you can predict with accuracy one explanatory variable using a multi variate regression trained on the others
a correlation matrix will only show you bivariate interconnectedness as opposed to the multivariate interconnectedness required for collnearity
therefore a correlation matrix seems to be a poor approach for collinearity detection¹

you have exogenous (X) and endogenous (y) variables and you are trying to predict y using X
if you can take one of the elements of X and predict it using a linear combination of the rest of X (which is a multivariate regression), that's not good
if you have highly interconnected X variables, that can lead to an unstable model that is very sensitive to slight changes in X
you might try to get rid of some of the variables involved in the interconnectedness to allow for a more stable model

groupby:

with a groupby you take a variable or variables, break up the remaining data columns into groups based on the values of what you are grouping by and then run some kind of aggregation function over each of these groups.

OOO:

select the columns you want to analyze
.groupby one or more of those columns, which means you break up the data into chunks based on what value of the grouped by column it falls under
run the function that you want to apply to each of those groups

In [120]:

new_df[["gpa", "gre", "prestige"]
      ].groupby("prestige").mean()

Out[120]:

In [135]:

grouped = \
pd.concat({"mean":
    new_df[["gpa", "gre", "prestige"]
          ].groupby("prestige").mean()}, axis=1
         ).join(pd.concat({"std":
                new_df[["gpa", "gre", "prestige"]
                      ].groupby("prestige").std()}, axis=1
                     )
               ).swaplevel(axis=1)

In [140]:

grouped = grouped[sorted(grouped.columns)]

In [141]:

grouped

Out[141]:

In [156]:

pretige_grouped = \
    new_df[["admit", "prestige"]
          ].groupby(["admit", "prestige"]
          ).size(
          ).rename("count"
          ).to_frame()

In [158]:

pretige_grouped

Out[158]:

Question 12. What did you find?

Answer: using the correlation matrix, gpa and gre seem to be very correlated @ almost 40%

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Answer:

prestige is categorical and it would be good to understand how it is bucketed
could more detailed data underlying prestige assignment be obtained
it would be good to know the variables associated with the buckets of admitance and prestige
i plan to use a groupby analysis to show the relationship between admitance and prestige

Question 14. What is your hypothesis?

Answer:

my hypothesis is that prestige of undergrard has an effect on grad school admittance.

remember null hypothesis = does not have an effect (?)

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

In [185]:

ax2 = new_df['gre'].hist(figsize=(11,8), bins=40)

Out[185]:

In [186]:

ax2 = new_df['gre'].apply(np.log).hist(figsize=(11,8), bins=40)

Out[186]:

In [187]:

ax3 = new_df['gpa'].hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)

Out[187]:

In [188]:

ax3 = new_df['gpa'].apply(np.log).hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)

Out[188]:

log reduces positive skewness, thus increases negative skewness
need to research how to reduce negative skewness

Project 2

Step 1: Load the python libraries you will need for this project

Step 2: Read in your data set

Questions

Question 1. How many observations are in our dataset?

Question 2. Create a summary table

Question 3. Why would GRE have a larger STD than GPA?

Question 4. Drop data points with missing data

Question 5. Confirm that you dropped the correct data. How can you tell?

Question 6. Create box plots for GRE and GPA

Question 7. What do these plots show?

Question 8. Describe each distribution

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Question 10. Does this distribution need correction? If so, why? How?

Question 11. Which of our variables are potentially colinear?

Collinearity

groupby:

Question 12. What did you find?

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Question 14. What is your hypothesis?

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

3. Advanced: Impute missing data

Product

Resources

Company