Path: blob/master/projects/project_2/starter-code/Project - 2.ipynb
1904 views
Project 2
Step 1: Load the python libraries you will need for this project
Step 2: Read in your data set
Questions
Question 1. How many observations are in our dataset?
Answer: 1595
Question 2. Create a summary table
Question 3. Why would GRE have a larger STD than GPA?
Answer: the measure of units is larger so the STD will be much larger. (2.2 min to 4.0 max vs. 220 min to 800 max)
A way to quantify this:
spread for gpa = 1.8, which proportionate to the lower bound of the range (1.8 / 2.2) is .818
spread for gre = 580, which proportionate to the lower bound of the range (580 / 220 is 290 / 110 is 145 / 55) is 2.64
so we have a relative spread of 82% versus 264%, so gre has roughly 3 times as much room to vary when normalizing each spread against its lower bound
in pandas, if you plot, it will automatically take the index and consider it the x axis and the values will be considered y.
this is non linear function
the rate of change (over this interval) is negative because the values are decreasing
the rate of change is the forst order derivative
the rate of change of the rate of change is not zero, because it will only be zero if the funtion was linear and the rate of change was constant, which in this case it is not
the rate of change of the rate of change is slowing down
if our rate of change is negative, what does it mean for it to be slowing down
it means that the rate of change is becoming less negative, how do you become less negative? through changing positively
so the rate of change of the rate of change, or the second order derivative, is positive.
i.e. this function is convex, as opposed to being concave (if the second order derivative is negative)
things to remember when dealing with derivatives:
position is where you are
velocity is the rate of change of your position
acceleration is the rate of change of your velocity
extrapolation is predciting a data point outside of the dataset. think extra. e.g. if today is the last point in the dataset and we want tomorrow's data, predicting that would be an extrapolation.
interpolation is predicting values within the confines of the dataset, e.g. filling in missing values
we have been trying to calc 9/11 using the known points 9/10 and 9/12, which is a non linear interpolation problem.
if it was a linear interpolation, we would take the average of 9/10 or .9 and 9/12 or .75 which is .15, divided by 2 = .075, so the mid point would be .75 + .075 or .9 - .075 = .825
we know that 9/11 is closer to .75 than .90, because the rate of change is more negative at the lower values.
we know the midpoint, which is .825 and we know that 9/11 is closer to 9/10, so the value of 9/11 has to be between .75 and .825 (this may be enough tollerance to stop but we will continue with one more cycle)
9/11 > 8/10 (see notes) so we know that .8 < 9/11 < .825
9/11 = .818
Question 4. Drop data points with missing data
Question 5. Confirm that you dropped the correct data. How can you tell?
Answer: .count()
gives the same result for all columns.
knowing that we started with different values in different columns (due to null values), seeing them line up with 397 values per, means we dropped the null value rows in the relevant columns.
three rows had null values which drops it from 400 to 397
Question 6. Create box plots for GRE and GPA
Question 7. What do these plots show?
Answer:
the mean and std provide a measure of central tendency and dispersion. these measures can be very useful, however they are very prone to outliers. instead of aggregating the data and chopping it into equal pieces, which is what the arithmetic mean does, you can also rank your data. these rankings are generally much less prone to bias as a result of outliers.
with that in mind, the box plot provides us a nice way to visualize how our data is distributed with respect to its percentile rank. the box plot shows us the median as well as the quartile above and below that. the box represents the interquartile range, with its top being the 75th percentile and its bottom being the 25th percentile. the line in the middle is the 50th percentile or the median.
the whiskers in a box plot are genrally set up to represent some measurement of the edges of the data. these can represent the min and max of the data, they can also represent as is commonly the case, the 5th and 95th percentiles of the data.
in the case that the whiskers do not represent the min and max, the box plot will represent points outside of this as outliers.
matplotlib will default to putting the whiskers at 75th percentile + 1.5 x (75th percentile - 25th percentile) and 25th percentile - 1.5 x (75th percentile - 25th percentile).
you can set the whiskers to be at the 5th and 95th percentile, or the min and max (see examples above)
in this case, the gre and gpa box plots show left/negative skewness. there is a asymetrical dispersion with the higher values being more concentrated near the mean and the lower values being dispersed. this means that
gpa: 75% of people score 3.1 - 4.0 and the straggling 25% get much lower grades
gre: 75% of people score 520 and above, with 75% of people fitting in less space on the plot than the bottom 25%.
most people come very prepared for the gre or work hard for their gpa and there is an extranious variable that the 25% link to, such as resource problems, langauge problems or a learning disability.
Question 8. Describe each distribution
more people were rejected than admitted
32% acceptance rate and a 68% rejection rate
ratio of admitance to non-admitance is 1 to 2
this data is not continuous
this is left/negative skewed
there's a large concentration of people who get the perfect score with a small bucket who just miss that mark
the modal bucket is a little over 600
this data looks to be more continuous than the previous set
this is left/negative skewed
there's a large concentration of people who get a perfect gpa
there are certain gpa's that are between the rounder numbers (.25 and .50) have a higher concentration of score
the modal bucket is 4.0
2 and 3 are the most common prestige scores, making up 67%
1 and 4 are almost equal in contribution to the whole @ ~15% each
Question 9. If our model had an assumption of a normal distribution would we meet that requirement?
Answer:
for the categorical data such as admitance and prestige, no
as for the others, there's skewness to the left and accumulation of values at the extremes, which means no
Question 10. Does this distribution need correction? If so, why? How?
Answer:
yes, because the gre and gpa data have outliers and exhibit skewness
a log transformation could help reduce the skewness in the gpa and gre data sets
Question 11. Which of our variables are potentially colinear?
rough rules for correlation:
anything above 5% is probably tangible
anything above 10% is tangible
correlation matrix:
with correlation you are looking at co-movement between two variables
with categorical data this doesn't really make sense, especially for binary data such as admitance
with prestige for example, there is presumably an ordering where the higher the value the higher the "prestige"
but what does an increase of 1 in prestige mean, what is the unit? these are buckets...
even more so for admit, 1 could easily mean non-admit and 0 admit, the data would have the same structure and you could build an identical model to predict whether someone was admitted
there is no nuance or difference between someone who was alsmost admitted and someone who is no where near being admitted
whereas if you had continuos probability of admitance for example, looking at the correlation between that and gpa would be useful
thus correlation across categorical variables does not make much sense.
in this situation, leveraging relational data analysis techniques, e.g. looking at avg. gpa across the different prestige values would lead to much more useful information around how connected these variables are
Collinearity
collinearity in the concept of statistics refers to being able to predict with accuracy an explanatory variable in a dataset through a linear combinaton of the other explanatory variables.
i.e. you can predict with accuracy one explanatory variable using a multi variate regression trained on the others
a correlation matrix will only show you bivariate interconnectedness as opposed to the multivariate interconnectedness required for collnearity
therefore a correlation matrix seems to be a poor approach for collinearity detection1
you have exogenous (X) and endogenous (y) variables and you are trying to predict y using X
if you can take one of the elements of X and predict it using a linear combination of the rest of X (which is a multivariate regression), that's not good
if you have highly interconnected X variables, that can lead to an unstable model that is very sensitive to slight changes in X
you might try to get rid of some of the variables involved in the interconnectedness to allow for a more stable model
groupby:
with a groupby you take a variable or variables, break up the remaining data columns into groups based on the values of what you are grouping by and then run some kind of aggregation function over each of these groups.
OOO:
select the columns you want to analyze
.groupby one or more of those columns, which means you break up the data into chunks based on what value of the grouped by column it falls under
run the function that you want to apply to each of those groups
Question 12. What did you find?
Answer: using the correlation matrix, gpa and gre seem to be very correlated @ almost 40%
Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.
Answer:
prestige is categorical and it would be good to understand how it is bucketed
could more detailed data underlying prestige assignment be obtained
it would be good to know the variables associated with the buckets of admitance and prestige
i plan to use a groupby analysis to show the relationship between admitance and prestige
Question 14. What is your hypothesis?
Answer:
my hypothesis is that prestige of undergrard has an effect on grad school admittance.
remember null hypothesis = does not have an effect (?)
Bonus/Advanced
1. Bonus: Explore alternatives to dropping obervations with missing data
2. Bonus: Log transform the skewed data
log reduces positive skewness, thus increases negative skewness
need to research how to reduce negative skewness