GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/projects/unit-projects/project-2/starter-code/project2-starter.ipynb
²³⁴⁹ views

Kernel: Python 2

Project 2

In this project, you will implement the exploratory analysis plan developed in Project 1. This will lay the groundwork for our our first modeling exercise in Project 3.

Step 1: Load the python libraries you will need for this project

In [1]:

#imports
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
import numpy as np
%matplotlib inline

Step 2: Read in your data set

In [2]:

#Read in data from source 
df_raw = pd.read_csv("../assets/admissions.csv")
print df_raw.head()

Out[2]:

   admit  gre   gpa  prestige
    0  380  3.61         3
    1  660  3.67         3
    1  800  4.00         1
    1  640  3.19         4
    0  520  2.93         4

Questions

Question 1. How many observations are in our dataset?

In [3]:

df_raw.count()

Out[3]:

admit       400
gre         398
gpa         398
prestige    399
dtype: int64

Answer:

Question 2. Create a summary table

In [ ]:

#function

In [ ]:

Question 3. Why would GRE have a larger STD than GPA?

Answer:

Question 4. Drop data points with missing data

In [ ]:

Question 5. Confirm that you dropped the correct data. How can you tell?

Answer:

Question 6. Create box plots for GRE and GPA

In [ ]:

#boxplot 1

In [ ]:

#boxplot 2

Question 7. What do this plots show?

Answer:

Question 8. Describe each distribution

In [ ]:

# plot the distribution of each variable

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Answer:

Question 10. Does this distribution need correction? If so, why? How?

Answer:

Question 11. Which of our variables are potentially colinear?

In [ ]:

# create a correlation matrix for the data

Question 12. What did you find?

Answer:

Project 2

Step 1: Load the python libraries you will need for this project

Step 2: Read in your data set

Questions

Question 1. How many observations are in our dataset?

Question 2. Create a summary table

Question 3. Why would GRE have a larger STD than GPA?

Question 4. Drop data points with missing data

Question 5. Confirm that you dropped the correct data. How can you tell?

Question 6. Create box plots for GRE and GPA

Question 7. What do this plots show?

Question 8. Describe each distribution

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Question 10. Does this distribution need correction? If so, why? How?

Question 11. Which of our variables are potentially colinear?

Question 12. What did you find?

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Question 14. What is your hypothesis?

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

3. Advanced: Impute missing data

Product

Resources

Company