Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_2/starter-code/Project - 2 - Yair Strano.ipynb
1904 views
Kernel: Python 3

Project 2

In this project, you will implement the exploratory analysis plan developed in Project 1. This will lay the groundwork for our our first modeling exercise in Project 3.

Step 1: Load the python libraries you will need for this project

#imports #from __future__ import division - this is only for python 2 import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm import pylab as pl %matplotlib inline
C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\statsmodels\compat\pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead. from pandas.core import datetools

Step 2: Read in your data set

#Read in data from source df_raw = pd.read_csv("../assets/admissions.csv") print(df_raw.head())
admit gre gpa prestige 0 0 380.0 3.61 3.0 1 1 660.0 3.67 3.0 2 1 800.0 4.00 1.0 3 1 640.0 3.19 4.0 4 0 520.0 2.93 4.0
list(df_raw.columns)
['admit', 'gre', 'gpa', 'prestige']

Questions

Question 1. How many observations are in our dataset?

df_raw.count()
admit 400 gre 398 gpa 398 prestige 399 dtype: int64
df_raw.count().sum()
1595

Answer: 1595

Question 2. Create a summary table

df_raw.describe()

Question 3. Why would GRE have a larger STD than GPA?

Answer:

the measure of units is much larger so the STD will be much larger. (2.2 min to 4.0 max vs. 220 min to 800 max)

A way to quantify this:

  • spread for gpa = 1.8, which proportionate to the lower bound of the range (1.8 / 2.2) is .818

  • spread for gre = 580, which proportionate to the lower bound of the range (580 / 220 is 290 / 110 is 145 / 55) is 2.64

so we have a relative spread of 82% versus 264%, so gre has roughly 3 times as much room to vary when normalizing each spread against its lower bound.

Question 4. Drop data points with missing data

df_raw.isnull().sum()
admit 0 gre 2 gpa 2 prestige 1 dtype: int64
new_df = df_raw.dropna()
new_df.count()
admit 397 gre 397 gpa 397 prestige 397 dtype: int64

Question 5. Confirm that you dropped the correct data. How can you tell?

new_df.count()
admit 397 gre 397 gpa 397 prestige 397 dtype: int64

Answer: .count() gives the same result for all columns.

knowing that we started with different values in different columns (due to null values), seeing them line up with 397 values per, means we dropped the null value rows in the relevant columns.

three rows had null values which drops it from 400 to 397.

Question 6. Create box plots for GRE and GPA

#boxplot 1 p_gre = plt.boxplot(new_df['gre'])
Image in a Jupyter notebook
p2_gre = plt.boxplot(new_df['gre'], whis=[5,95])
Image in a Jupyter notebook
#boxplot 2 p_gpa = plt.boxplot(new_df['gpa'])
Image in a Jupyter notebook
p2_gpa = plt.boxplot(new_df['gpa'], whis=[0,100])
Image in a Jupyter notebook

Question 7. What do these plots show?

Answer:

the mean and std provide a measure of central tendency and dispersion. these measures can be very useful, however they are very prone to outliers. instead of aggregating the data and chopping it into equal pieces, which is what the arithmetic mean does, you can also rank your data. these rankings are generally much less prone to bias as a result of outliers.

with that in mind, the box plot provides us a nice way to visualize how our data is distributed with respect to its percentile rank. the box plot shows us the median as well as the quartile above and below that. the box represents the interquartile range, with its top being the 75th percentile and its bottom being the 25th percentile. the line in the middle is the 50th percentile or the median.

the whiskers in a box plot are genrally set up to represent some measurement of the edges of the data. these can represent the min and max of the data, they can also represent as is commonly the case, the 5th and 95th percentiles of the data.

in the case that the whiskers do not represent the min and max, the box plot will represent points outside of this as outliers.

matplotlib will default to putting the whiskers at 75th percentile + 1.5 x (75th percentile - 25th percentile) and 25th percentile - 1.5 x (75th percentile - 25th percentile).

you can set the whiskers to be at the 5th and 95th percentile, or the min and max (see examples above)

in this case, the gre and gpa box plots show left/negative skewness. there is a asymetrical dispersion with the higher values being more concentrated near the mean and the lower values being dispersed. this means that

  • gpa: 75% of people score 3.1 - 4.0 and the straggling 25% get much lower grades.

  • gre: 75% of people score 520 and above, with 75% of people fitting in less space on the plot than the bottom 25%.

  • most people come very prepared for the gre or work hard for their gpa and there is an extranious variable that the 25% link to, such as resource problems, langauge problems or a learning disability.

Question 8. Describe each distribution

# plot the distribution of each variable
new_df['admit'].value_counts()
0 271 1 126 Name: admit, dtype: int64
new_df['admit'].value_counts() / new_df['admit'].value_counts().sum()
0 0.68262 1 0.31738 Name: admit, dtype: float64
ax1 = new_df['admit'].value_counts().plot.barh() title1 = ax1.set_title('admit')
Image in a Jupyter notebook
  • more people were rejected than admitted

  • 32% acceptance rate and a 68% rejection rate

  • ratio of admitance to non-admitance is 1 to 2

ax2 = new_df['gre'].hist(figsize=(11,8), bins=40)
Image in a Jupyter notebook
new_df['gre'].mode()
0 620.0 dtype: float64
  • this data is not continuous

  • this is left/negative skewed

  • there's a large concentration of people who get the perfect score with a small bucket who just miss that mark

  • the modal bucket is a little over 600

ax3 = new_df['gpa'].hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)
Image in a Jupyter notebook
  • this data looks to be more continuous than the previous set

  • this is left/negative skewed

  • there's a large concentration of people who get a perfect gpa

  • there are certain gpa's that are between the rounder numbers (.25 and .50) have a higher concentration of score

  • the modal bucket is 4.0

new_df['gpa'].mode()
0 4.0 dtype: float64
new_df['prestige'].value_counts()
2.0 148 3.0 121 4.0 67 1.0 61 Name: prestige, dtype: int64
new_df['prestige'].value_counts() / new_df['prestige'].value_counts().sum()
2.0 0.372796 3.0 0.304786 4.0 0.168766 1.0 0.153652 Name: prestige, dtype: float64
ax1 = new_df['prestige'].value_counts( ).sort_index(ascending=False ).plot.barh(figsize=(8, 5)) title1 = ax1.set_title('prestige')
Image in a Jupyter notebook
  • 2 and 3 are the most common prestige scores, making up 67%

  • 1 and 4 are almost equal in contribution to the whole @ ~15% each

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Answer:

  • for the categorical data such as admitance and prestige, no.

  • as for the others, there's skewness to the left and accumulation of values at the extremes, which means no.

Question 10. Does this distribution need correction? If so, why? How?

Answer:

  • yes, because the gre and gpa data have outliers and exhibit skewness.

  • a log transformation could help reduce the skewness in the gpa and gre data sets.

Question 11. Which of our variables are potentially colinear?

# create a correlation matrix for the data new_df.corr()

Question 12. What did you find?

Answer:

using the correlation matrix, gpa and gre seem to be very correlated @ almost 40%.

correlation matrix:

  • with correlation you are looking at co-movement between two variables.

  • with categorical data this doesn't really make sense, especially for binary data such as admitance.

  • with prestige for example, there is presumably an ordering where the higher the value, the higher the "prestige".

  • but what does an increase of 1 in prestige mean, what is the unit? these are buckets...

  • even more so for admit, 1 could easily mean non-admit and 0 admit, the data would have the same structure and you could build an identical model to predict whether someone was admitted.

  • there is no nuance or difference between someone who was alsmost admitted and someone who is nowhere near being admitted.

  • whereas if you had continuos probability of admitance for example, looking at the correlation between that and gpa would be useful.

  • thus correlation across categorical variables does not make much sense.

  • in this situation, leveraging relational data analysis techniques, e.g. looking at avg. gpa across the different prestige values would lead to much more useful information around how connected these variables are.

collinearity:

  • collinearity in the concept of statistics refers to being able to predict with accuracy an explanatory variable in a dataset through a linear combinaton of the other explanatory variables.

  • i.e. you can predict with accuracy one explanatory variable using a multi variate regression trained on the others.

  • a correlation matrix will only show you bivariate interconnectedness as opposed to the multivariate interconnectedness required for collnearity.

  • therefore a correlation matrix seems to be a poor approach for collinearity detection

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Answer:

  • prestige is categorical and it would be good to understand how it is bucketed.

  • could more detailed data underlying prestige assignment be obtained?

  • it would be good to know the variables associated with the buckets of admitance and prestige.

  • i plan to use a groupby analysis to show the relationship between admitance and prestige.

pretige_grouped = \ new_df[["admit", "prestige"] ].groupby(["admit", "prestige"] ).size( ).rename("count" ).to_frame()
pretige_grouped
new_df[["gpa", "gre", "prestige"]].groupby("prestige").mean()
grouped = \ pd.concat({"mean": new_df[["gpa", "gre", "prestige"] ].groupby("prestige").mean()}, axis=1 ).join(pd.concat({"std": new_df[["gpa", "gre", "prestige"] ].groupby("prestige").std()}, axis=1 ) ).swaplevel(axis=1)
grouped = grouped[sorted(grouped.columns)]
grouped

Question 14. What is your hypothesis?

Answer:

my hypothesis is that prestige of undergrard has an effect on grad school admittance.

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

ax2 = new_df['gre'].hist(figsize=(11,8), bins=40)
Image in a Jupyter notebook
ax2 = new_df['gre'].apply(np.log).hist(figsize=(11,8), bins=40)
Image in a Jupyter notebook
ax3 = new_df['gpa'].hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)
Image in a Jupyter notebook
ax3 = new_df['gpa'].apply(np.log).hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)
Image in a Jupyter notebook
  • log reduces positive skewness, thus increases negative skewness.

  • need to research how to reduce negative skewness (?)

3. Advanced: Impute missing data