Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/projects/project_2/starter-code/Project - 2.ipynb
1904 views
Kernel: Python 3

Project 2

Step 1: Load the python libraries you will need for this project

#imports #from __future__ import division - this is only for python 2 import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm import pylab as pl %matplotlib inline
/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead. from pandas.core import datetools

Step 2: Read in your data set

#Read in data from source df_raw = pd.read_csv("../assets/admissions.csv") print(df_raw.head())
admit gre gpa prestige 0 0 380.0 3.61 3.0 1 1 660.0 3.67 3.0 2 1 800.0 4.00 1.0 3 1 640.0 3.19 4.0 4 0 520.0 2.93 4.0
list(df_raw.columns)
['admit', 'gre', 'gpa', 'prestige']

Questions

Question 1. How many observations are in our dataset?

df_raw.count() # count() returns a series with the values being the number of observations and the index being the column name
admit 400 gre 398 gpa 398 prestige 399 dtype: int64
df_raw.count().sum()
1595

Answer: 1595

Question 2. Create a summary table

#function
df_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 400 entries, 0 to 399 Data columns (total 4 columns): admit 400 non-null int64 gre 398 non-null float64 gpa 398 non-null float64 prestige 399 non-null float64 dtypes: float64(3), int64(1) memory usage: 12.6 KB
df_raw.describe()

Question 3. Why would GRE have a larger STD than GPA?

Answer: the measure of units is larger so the STD will be much larger. (2.2 min to 4.0 max vs. 220 min to 800 max)

A way to quantify this:

  • spread for gpa = 1.8, which proportionate to the lower bound of the range (1.8 / 2.2) is .818

  • spread for gre = 580, which proportionate to the lower bound of the range (580 / 220 is 290 / 110 is 145 / 55) is 2.64

so we have a relative spread of 82% versus 264%, so gre has roughly 3 times as much room to vary when normalizing each spread against its lower bound

cont = [] for i in range(1,21): cont.append(9/i)
cont = [9/i for i in range(1, 21)] #list comprehension, first what you are appending and then write the loop, and you can add an if at the end
cont
[9.0, 4.5, 3.0, 2.25, 1.8, 1.5, 1.2857142857142858, 1.125, 1.0, 0.9, 0.8181818181818182, 0.75, 0.6923076923076923, 0.6428571428571429, 0.6, 0.5625, 0.5294117647058824, 0.5, 0.47368421052631576, 0.45]
ind = list(range(1, 21))
df = pd.DataFrame(ind)
df['cont'] = cont
df.columns = ['denominator', 'value']
df = df.set_index('denominator')
df

in pandas, if you plot, it will automatically take the index and consider it the x axis and the values will be considered y.

ax = df.plot(figsize=[11,8]) #keyword arguments don't get a space tit = plt.title('function: 9/x')
Image in a Jupyter notebook
  • this is non linear function

  • the rate of change (over this interval) is negative because the values are decreasing

  • the rate of change is the forst order derivative

  • the rate of change of the rate of change is not zero, because it will only be zero if the funtion was linear and the rate of change was constant, which in this case it is not

  • the rate of change of the rate of change is slowing down

  • if our rate of change is negative, what does it mean for it to be slowing down

  • it means that the rate of change is becoming less negative, how do you become less negative? through changing positively

  • so the rate of change of the rate of change, or the second order derivative, is positive.

  • i.e. this function is convex, as opposed to being concave (if the second order derivative is negative)

  • things to remember when dealing with derivatives:

  • position is where you are

  • velocity is the rate of change of your position

  • acceleration is the rate of change of your velocity

  • extrapolation is predciting a data point outside of the dataset. think extra. e.g. if today is the last point in the dataset and we want tomorrow's data, predicting that would be an extrapolation.

  • interpolation is predicting values within the confines of the dataset, e.g. filling in missing values

  • we have been trying to calc 9/11 using the known points 9/10 and 9/12, which is a non linear interpolation problem.

  • if it was a linear interpolation, we would take the average of 9/10 or .9 and 9/12 or .75 which is .15, divided by 2 = .075, so the mid point would be .75 + .075 or .9 - .075 = .825

  • we know that 9/11 is closer to .75 than .90, because the rate of change is more negative at the lower values.

  • we know the midpoint, which is .825 and we know that 9/11 is closer to 9/10, so the value of 9/11 has to be between .75 and .825 (this may be enough tollerance to stop but we will continue with one more cycle)

  • 9/11 > 8/10 (see notes) so we know that .8 < 9/11 < .825

  • 9/11 = .818

Question 4. Drop data points with missing data

df_raw.isnull().sum()
admit 0 gre 2 gpa 2 prestige 1 dtype: int64
df_raw[df_raw.isnull().any(axis=1)]
new_df = df_raw.dropna()
new_df2 = df_raw[~df_raw.isnull().any(axis=1)].copy()
new_df2.count()
admit 397 gre 397 gpa 397 prestige 397 dtype: int64

Question 5. Confirm that you dropped the correct data. How can you tell?

new_df.count()
admit 397 gre 397 gpa 397 prestige 397 dtype: int64

Answer: .count() gives the same result for all columns.

knowing that we started with different values in different columns (due to null values), seeing them line up with 397 values per, means we dropped the null value rows in the relevant columns.

three rows had null values which drops it from 400 to 397

Question 6. Create box plots for GRE and GPA

#boxplot 1 p_gre = plt.boxplot(new_df['gre'])
Image in a Jupyter notebook
p2_gre = plt.boxplot(new_df['gre'], whis=[5,95])
Image in a Jupyter notebook
#boxplot 2 p_gpa = plt.boxplot(new_df['gpa'])
Image in a Jupyter notebook
p2_gpa = plt.boxplot(new_df['gpa'], whis=[0,100])
Image in a Jupyter notebook

Question 7. What do these plots show?

Answer:

the mean and std provide a measure of central tendency and dispersion. these measures can be very useful, however they are very prone to outliers. instead of aggregating the data and chopping it into equal pieces, which is what the arithmetic mean does, you can also rank your data. these rankings are generally much less prone to bias as a result of outliers.

with that in mind, the box plot provides us a nice way to visualize how our data is distributed with respect to its percentile rank. the box plot shows us the median as well as the quartile above and below that. the box represents the interquartile range, with its top being the 75th percentile and its bottom being the 25th percentile. the line in the middle is the 50th percentile or the median.

the whiskers in a box plot are genrally set up to represent some measurement of the edges of the data. these can represent the min and max of the data, they can also represent as is commonly the case, the 5th and 95th percentiles of the data.

in the case that the whiskers do not represent the min and max, the box plot will represent points outside of this as outliers.

matplotlib will default to putting the whiskers at 75th percentile + 1.5 x (75th percentile - 25th percentile) and 25th percentile - 1.5 x (75th percentile - 25th percentile).

you can set the whiskers to be at the 5th and 95th percentile, or the min and max (see examples above)

in this case, the gre and gpa box plots show left/negative skewness. there is a asymetrical dispersion with the higher values being more concentrated near the mean and the lower values being dispersed. this means that

  • gpa: 75% of people score 3.1 - 4.0 and the straggling 25% get much lower grades

  • gre: 75% of people score 520 and above, with 75% of people fitting in less space on the plot than the bottom 25%.

  • most people come very prepared for the gre or work hard for their gpa and there is an extranious variable that the 25% link to, such as resource problems, langauge problems or a learning disability.

Question 8. Describe each distribution

# plot the distribution of each variable
new_df['admit'].value_counts() #no need to hist or plot this, bc this is categorical, so value counts works better
0 271 1 126 Name: admit, dtype: int64
new_df['admit'].value_counts() / new_df['admit'].value_counts().sum()
0 0.68262 1 0.31738 Name: admit, dtype: float64
ax1 = new_df['admit'].value_counts().plot.barh() title1 = ax1.set_title('admit')
Image in a Jupyter notebook
  • more people were rejected than admitted

  • 32% acceptance rate and a 68% rejection rate

  • ratio of admitance to non-admitance is 1 to 2

ax2 = new_df['gre'].hist(figsize=(11,8), bins=40)
Image in a Jupyter notebook
new_df['gre'].mode()
0 620.0 dtype: float64
  • this data is not continuous

  • this is left/negative skewed

  • there's a large concentration of people who get the perfect score with a small bucket who just miss that mark

  • the modal bucket is a little over 600

ax3 = new_df['gpa'].hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)
Image in a Jupyter notebook
  • this data looks to be more continuous than the previous set

  • this is left/negative skewed

  • there's a large concentration of people who get a perfect gpa

  • there are certain gpa's that are between the rounder numbers (.25 and .50) have a higher concentration of score

  • the modal bucket is 4.0

new_df['gpa'].mode()
0 4.0 dtype: float64
new_df['prestige'].value_counts() #no need to hist or plot this, bc this is categorical, so value counts works better
2.0 148 3.0 121 4.0 67 1.0 61 Name: prestige, dtype: int64
new_df['prestige'].value_counts() / new_df['prestige'].value_counts().sum()
2.0 0.372796 3.0 0.304786 4.0 0.168766 1.0 0.153652 Name: prestige, dtype: float64
ax1 = new_df['prestige'].value_counts( ).sort_index(ascending=False ).plot.barh(figsize=(8, 5)) title1 = ax1.set_title('prestige')
Image in a Jupyter notebook
  • 2 and 3 are the most common prestige scores, making up 67%

  • 1 and 4 are almost equal in contribution to the whole @ ~15% each

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Answer:

  • for the categorical data such as admitance and prestige, no

  • as for the others, there's skewness to the left and accumulation of values at the extremes, which means no

Question 10. Does this distribution need correction? If so, why? How?

Answer:

  • yes, because the gre and gpa data have outliers and exhibit skewness

  • a log transformation could help reduce the skewness in the gpa and gre data sets

Question 11. Which of our variables are potentially colinear?

# create a correlation matrix for the data new_df.corr()

rough rules for correlation:

  • anything above 5% is probably tangible

  • anything above 10% is tangible

correlation matrix:

  • with correlation you are looking at co-movement between two variables

  • with categorical data this doesn't really make sense, especially for binary data such as admitance

  • with prestige for example, there is presumably an ordering where the higher the value the higher the "prestige"

  • but what does an increase of 1 in prestige mean, what is the unit? these are buckets...

  • even more so for admit, 1 could easily mean non-admit and 0 admit, the data would have the same structure and you could build an identical model to predict whether someone was admitted

  • there is no nuance or difference between someone who was alsmost admitted and someone who is no where near being admitted

  • whereas if you had continuos probability of admitance for example, looking at the correlation between that and gpa would be useful

  • thus correlation across categorical variables does not make much sense.

  • in this situation, leveraging relational data analysis techniques, e.g. looking at avg. gpa across the different prestige values would lead to much more useful information around how connected these variables are

Collinearity

  • collinearity in the concept of statistics refers to being able to predict with accuracy an explanatory variable in a dataset through a linear combinaton of the other explanatory variables.

  • i.e. you can predict with accuracy one explanatory variable using a multi variate regression trained on the others

  • a correlation matrix will only show you bivariate interconnectedness as opposed to the multivariate interconnectedness required for collnearity

  • therefore a correlation matrix seems to be a poor approach for collinearity detection1

  • you have exogenous (X) and endogenous (y) variables and you are trying to predict y using X

  • if you can take one of the elements of X and predict it using a linear combination of the rest of X (which is a multivariate regression), that's not good

  • if you have highly interconnected X variables, that can lead to an unstable model that is very sensitive to slight changes in X

  • you might try to get rid of some of the variables involved in the interconnectedness to allow for a more stable model

groupby:

  • with a groupby you take a variable or variables, break up the remaining data columns into groups based on the values of what you are grouping by and then run some kind of aggregation function over each of these groups.

OOO:

  • select the columns you want to analyze

  • .groupby one or more of those columns, which means you break up the data into chunks based on what value of the grouped by column it falls under

  • run the function that you want to apply to each of those groups

new_df[["gpa", "gre", "prestige"] ].groupby("prestige").mean()
grouped = \ pd.concat({"mean": new_df[["gpa", "gre", "prestige"] ].groupby("prestige").mean()}, axis=1 ).join(pd.concat({"std": new_df[["gpa", "gre", "prestige"] ].groupby("prestige").std()}, axis=1 ) ).swaplevel(axis=1)
grouped = grouped[sorted(grouped.columns)]
grouped
pretige_grouped = \ new_df[["admit", "prestige"] ].groupby(["admit", "prestige"] ).size( ).rename("count" ).to_frame()
pretige_grouped

Question 12. What did you find?

Answer: using the correlation matrix, gpa and gre seem to be very correlated @ almost 40%

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Answer:

  • prestige is categorical and it would be good to understand how it is bucketed

  • could more detailed data underlying prestige assignment be obtained

  • it would be good to know the variables associated with the buckets of admitance and prestige

  • i plan to use a groupby analysis to show the relationship between admitance and prestige

Question 14. What is your hypothesis?

Answer:

my hypothesis is that prestige of undergrard has an effect on grad school admittance.

remember null hypothesis = does not have an effect (?)

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

ax2 = new_df['gre'].hist(figsize=(11,8), bins=40)
Image in a Jupyter notebook
ax2 = new_df['gre'].apply(np.log).hist(figsize=(11,8), bins=40)
Image in a Jupyter notebook
ax3 = new_df['gpa'].hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)
Image in a Jupyter notebook
ax3 = new_df['gpa'].apply(np.log).hist(figsize=(11,8), bins=40, color="turquoise", alpha=.65)
Image in a Jupyter notebook
  • log reduces positive skewness, thus increases negative skewness

  • need to research how to reduce negative skewness

3. Advanced: Impute missing data