GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_2/starter-code/Project - 2 - Yair Strano.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Project 2

In this project, you will implement the exploratory analysis plan developed in Project 1. This will lay the groundwork for our our first modeling exercise in Project 3.

Step 1: Load the python libraries you will need for this project

In [1]:

#imports
#from __future__ import division      - this is only for python 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
%matplotlib inline

Out[1]:

C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Step 2: Read in your data set

In [2]:

#Read in data from source 
df_raw = pd.read_csv("../assets/admissions.csv")
print(df_raw.head())

Out[2]:

   admit    gre   gpa  prestige
    0  380.0  3.61       3.0
    1  660.0  3.67       3.0
    1  800.0  4.00       1.0
    1  640.0  3.19       4.0
    0  520.0  2.93       4.0

In [3]:

list(df_raw.columns)

Out[3]:

['admit', 'gre', 'gpa', 'prestige']

Questions

Question 1. How many observations are in our dataset?

In [9]:

df_raw.count()

Out[9]:

admit       400
gre         398
gpa         398
prestige    399
dtype: int64

In [10]:

df_raw.count().sum()

Out[10]:

1595

Answer: 1595

Question 2. Create a summary table

In [8]:

df_raw.describe()

Out[8]:

Question 3. Why would GRE have a larger STD than GPA?

Answer:

the measure of units is much larger so the STD will be much larger. (2.2 min to 4.0 max vs. 220 min to 800 max)

A way to quantify this:

spread for gpa = 1.8, which proportionate to the lower bound of the range (1.8 / 2.2) is .818
spread for gre = 580, which proportionate to the lower bound of the range (580 / 220 is 290 / 110 is 145 / 55) is 2.64

so we have a relative spread of 82% versus 264%, so gre has roughly 3 times as much room to vary when normalizing each spread against its lower bound.

Question 4. Drop data points with missing data

In [20]:

df_raw.isnull().sum()

Out[20]:

admit       0
gre         2
gpa         2
prestige    1
dtype: int64

In [21]:

new_df = df_raw.dropna()

In [22]:

new_df.count()

Out[22]:

admit       397
gre         397
gpa         397
prestige    397
dtype: int64

Question 5. Confirm that you dropped the correct data. How can you tell?

In [27]:

new_df.count()

Out[27]:

admit       397
gre         397
gpa         397
prestige    397
dtype: int64

Answer: .count() gives the same result for all columns.

knowing that we started with different values in different columns (due to null values), seeing them line up with 397 values per, means we dropped the null value rows in the relevant columns.

three rows had null values which drops it from 400 to 397.

Question 6. Create box plots for GRE and GPA

In [28]:

#boxplot 1
p_gre = plt.boxplot(new_df['gre'])

Out[28]:

In [29]:

p2_gre = plt.boxplot(new_df['gre'], whis=[5,95])

Out[29]:

In [30]:

#boxplot 2 
p_gpa = plt.boxplot(new_df['gpa'])

Out[30]:

In [31]:

p2_gpa = plt.boxplot(new_df['gpa'], whis=[0,100])

Out[31]:

Question 7. What do these plots show?

Answer:

the mean and std provide a measure of central tendency and dispersion. these measures can be very useful, however they are very prone to outliers. instead of aggregating the data and chopping it into equal pieces, which is what the arithmetic mean does, you can also rank your data. these rankings are generally much less prone to bias as a result of outliers.

with that in mind, the box plot provides us a nice way to visualize how our data is distributed with respect to its percentile rank. the box plot shows us the median as well as the quartile above and below that. the box represents the interquartile range, with its top being the 75th percentile and its bottom being the 25th percentile. the line in the middle is the 50th percentile or the median.

the whiskers in a box plot are genrally set up to represent some measurement of the edges of the data. these can represent the min and max of the data, they can also represent as is commonly the case, the 5th and 95th percentiles of the data.

in the case that the whiskers do not represent the min and max, the box plot will represent points outside of this as outliers.

matplotlib will default to putting the whiskers at 75th percentile + 1.5 x (75th percentile - 25th percentile) and 25th percentile - 1.5 x (75th percentile - 25th percentile).

you can set the whiskers to be at the 5th and 95th percentile, or the min and max (see examples above)

in this case, the gre and gpa box plots show left/negative skewness. there is a asymetrical dispersion with the higher values being more concentrated near the mean and the lower values being dispersed. this means that

gpa: 75% of people score 3.1 - 4.0 and the straggling 25% get much lower grades.
gre: 75% of people score 520 and above, with 75% of people fitting in less space on the plot than the bottom 25%.
most people come very prepared for the gre or work hard for their gpa and there is an extranious variable that the 25% link to, such as resource problems, langauge problems or a learning disability.

Question 8. Describe each distribution

In [32]:

# plot the distribution of each variable

In [51]:

new_df['admit'].value_counts()

Out[51]:

0    271
1    126
Name: admit, dtype: int64

In [34]:

new_df['admit'].value_counts() / new_df['admit'].value_counts().sum()

Out[34]:

0    0.68262
1    0.31738
Name: admit, dtype: float64

In [35]:

ax1 = new_df['admit'].value_counts().plot.barh()
title1 = ax1.set_title('admit')

Out[35]:

more people were rejected than admitted
32% acceptance rate and a 68% rejection rate
ratio of admitance to non-admitance is 1 to 2

In [36]:

ax2 = new_df['gre'].hist(figsize=(11,8), bins=40)

Out[36]:

In [37]:

new_df['gre'].mode()

Out[37]:

0    620.0
dtype: float64

this data is not continuous
this is left/negative skewed
there's a large concentration of people who get the perfect score with a small bucket who just miss that mark
the modal bucket is a little over 600

In [38]:

ax3 = new_df['gpa'].hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)

Out[38]:

this data looks to be more continuous than the previous set
this is left/negative skewed
there's a large concentration of people who get a perfect gpa
there are certain gpa's that are between the rounder numbers (.25 and .50) have a higher concentration of score
the modal bucket is 4.0

In [39]:

new_df['gpa'].mode()

Out[39]:

0    4.0
dtype: float64

In [41]:

new_df['prestige'].value_counts()

Out[41]:

0    148
0    121
0     67
0     61
Name: prestige, dtype: int64

In [42]:

new_df['prestige'].value_counts() / new_df['prestige'].value_counts().sum()

Out[42]:

0    0.372796
0    0.304786
0    0.168766
0    0.153652
Name: prestige, dtype: float64

In [43]:

ax1 = new_df['prestige'].value_counts(
                       ).sort_index(ascending=False
                       ).plot.barh(figsize=(8, 5))
title1 = ax1.set_title('prestige')

Out[43]:

2 and 3 are the most common prestige scores, making up 67%
1 and 4 are almost equal in contribution to the whole @ ~15% each

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Answer:

for the categorical data such as admitance and prestige, no.
as for the others, there's skewness to the left and accumulation of values at the extremes, which means no.

Question 10. Does this distribution need correction? If so, why? How?

Answer:

yes, because the gre and gpa data have outliers and exhibit skewness.
a log transformation could help reduce the skewness in the gpa and gre data sets.

Question 11. Which of our variables are potentially colinear?

In [44]:

# create a correlation matrix for the data
new_df.corr()

Out[44]:

Question 12. What did you find?

Answer:

using the correlation matrix, gpa and gre seem to be very correlated @ almost 40%.

correlation matrix:

with correlation you are looking at co-movement between two variables.
with categorical data this doesn't really make sense, especially for binary data such as admitance.
with prestige for example, there is presumably an ordering where the higher the value, the higher the "prestige".
but what does an increase of 1 in prestige mean, what is the unit? these are buckets...
even more so for admit, 1 could easily mean non-admit and 0 admit, the data would have the same structure and you could build an identical model to predict whether someone was admitted.
there is no nuance or difference between someone who was alsmost admitted and someone who is nowhere near being admitted.
whereas if you had continuos probability of admitance for example, looking at the correlation between that and gpa would be useful.
thus correlation across categorical variables does not make much sense.
in this situation, leveraging relational data analysis techniques, e.g. looking at avg. gpa across the different prestige values would lead to much more useful information around how connected these variables are.

collinearity:

collinearity in the concept of statistics refers to being able to predict with accuracy an explanatory variable in a dataset through a linear combinaton of the other explanatory variables.
i.e. you can predict with accuracy one explanatory variable using a multi variate regression trained on the others.
a correlation matrix will only show you bivariate interconnectedness as opposed to the multivariate interconnectedness required for collnearity.
therefore a correlation matrix seems to be a poor approach for collinearity detection

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Answer:

prestige is categorical and it would be good to understand how it is bucketed.
could more detailed data underlying prestige assignment be obtained?
it would be good to know the variables associated with the buckets of admitance and prestige.
i plan to use a groupby analysis to show the relationship between admitance and prestige.

In [78]:

pretige_grouped = \
    new_df[["admit", "prestige"]
          ].groupby(["admit", "prestige"]
          ).size(
          ).rename("count"
          ).to_frame()

In [79]:

pretige_grouped

Out[79]:

In [53]:

new_df[["gpa", "gre", "prestige"]].groupby("prestige").mean()

Out[53]:

In [73]:

grouped = \
pd.concat({"mean":
    new_df[["gpa", "gre", "prestige"]
          ].groupby("prestige").mean()}, axis=1
         ).join(pd.concat({"std":
                new_df[["gpa", "gre", "prestige"]
                      ].groupby("prestige").std()}, axis=1
                     )
               ).swaplevel(axis=1)

In [74]:

grouped = grouped[sorted(grouped.columns)]

In [76]:

grouped

Out[76]:

Question 14. What is your hypothesis?

Answer:

my hypothesis is that prestige of undergrard has an effect on grad school admittance.

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

In [185]:

ax2 = new_df['gre'].hist(figsize=(11,8), bins=40)

Out[185]:

In [186]:

ax2 = new_df['gre'].apply(np.log).hist(figsize=(11,8), bins=40)

Out[186]:

In [187]:

ax3 = new_df['gpa'].hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)

Out[187]:

In [188]:

ax3 = new_df['gpa'].apply(np.log).hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)

Out[188]:

log reduces positive skewness, thus increases negative skewness.
need to research how to reduce negative skewness (?)

Project 2

Step 1: Load the python libraries you will need for this project

Step 2: Read in your data set

Questions

Question 1. How many observations are in our dataset?

Question 2. Create a summary table

Question 3. Why would GRE have a larger STD than GPA?

Question 4. Drop data points with missing data

Question 5. Confirm that you dropped the correct data. How can you tell?

Question 6. Create box plots for GRE and GPA

Question 7. What do these plots show?

Question 8. Describe each distribution

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Question 10. Does this distribution need correction? If so, why? How?

Question 11. Which of our variables are potentially colinear?

Question 12. What did you find?

correlation matrix:

collinearity:

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Question 14. What is your hypothesis?

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

3. Advanced: Impute missing data

Product

Resources

Company