Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/projects/unit-projects/project-2/starter-code/project2-starter.ipynb
1905 views
Kernel: Python 2

Project 2

In this project, you will implement the exploratory analysis plan developed in Project 1. This will lay the groundwork for our our first modeling exercise in Project 3.

Step 1: Load the python libraries you will need for this project

#imports from __future__ import division import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm import pylab as pl import numpy as np %matplotlib inline

Step 2: Read in your data set

#Read in data from source df_raw = pd.read_csv("../assets/admissions.csv") print df_raw.head()
admit gre gpa prestige 0 0 380 3.61 3 1 1 660 3.67 3 2 1 800 4.00 1 3 1 640 3.19 4 4 0 520 2.93 4

Questions

Question 1. How many observations are in our dataset?

df_raw.count()
admit 400 gre 398 gpa 398 prestige 399 dtype: int64

Answer:

Question 2. Create a summary table

#function

Question 3. Why would GRE have a larger STD than GPA?

Answer:

Question 4. Drop data points with missing data

Question 5. Confirm that you dropped the correct data. How can you tell?

Answer:

Question 6. Create box plots for GRE and GPA

#boxplot 1
#boxplot 2

Question 7. What do this plots show?

Answer:

Question 8. Describe each distribution

# plot the distribution of each variable

Question 9. If our model had an assumption of a normal distribution would we meet that requirement?

Answer:

Question 10. Does this distribution need correction? If so, why? How?

Answer:

Question 11. Which of our variables are potentially colinear?

# create a correlation matrix for the data

Question 12. What did you find?

Answer:

Question 13. Write an analysis plan for exploring the association between grad school admissions rates and prestige of undergraduate schools.

Answer:

Question 14. What is your hypothesis?

Answer:

Bonus/Advanced

1. Bonus: Explore alternatives to dropping obervations with missing data

2. Bonus: Log transform the skewed data

3. Advanced: Impute missing data