Project #2: Exploratory Analysis
DS | Unit Project 2
PROMPT
In this project, you will implement the exploratory analysis plan developed in Project 1. This will lay the groundwork for our first modeling exercise in Project 3.
Before completing an analysis, it is critical to understand your data. You will need to identify all the biases and variables in your model in order to accurately assess the strengths and limitations of your analysis and predictions.
Following these steps will help you better understand your dataset.
Goal: An IPython notebook writeup that provides a dataset overview with visualizations and statistical analysis.
DELIVERABLES
IPython Notebook Exploratory Analysis
Requirements:
Read in your dataset, determine how many samples are present, and ID any missing data
Create a table of descriptive statistics for each of the variables (n, mean, median, standard deviation)
Describe the distributions of your data
Plot box plots for each variable
Create a covariance matrix
Determine any issues or limitations, based on your exploratory analysis
Bonus:
Replace missing values using the median replacement method
Log transform data to meet normality requirements
Advanced Option: Replace missing values using multiple imputation methods
Submission:
Instructor TBD
TIMELINE
Deadline | Deliverable | Description |
---|---|---|
Lesson 5 | Project 2 | Exploratory Data Analysis |
EVALUATION
Your project will be assessed using the following standards:
Parse the Data
Rubric: Click here for the complete rubric.
Requirements for these standards will be assessed using the scale below:
While your total score is a helpful gauge of whether you've met overall project goals, specific scores are more important since they'll show you where to focus your efforts in the future!
RESOURCES
Dataset
We'll be using the same dataset as UCLA's Logistic Regression in R tutorial to explore logistic regression in Python, as explained in yhat's blog. This is an excellent resource for using logistic regression and summary statistics to explore a relevant dataset. Our goal will be to identify the various factors that may influence admission into graduate school. It contains four variables- admit, gre, gpa, rank.
'admit' is a binary variable. It indicates whether or not a candidate was admitted admit =1) our not (admit= 0)
'gre' is GRE score
'gpa' stands for Grade Point Average
'rank' is the rank of an applicant's undergraduate alma mater, with 1 being the highest and 4 as the lowest
Dataset: Admissions.csv
Starter code
For this project we will be using an IPython notebook. This notebook will use matplotlib for plotting and visualizing our data. This type of visualization is handy for prototyping and quick data analysis. We will discuss more advanced data visualizations for disseminating your work.
Open the starter code instructions in IPython notebook.
Suggestions for Getting Started
Read in your dataset
Try out a few pandas commands for describing your data:
df['dataframeName'].describe(),
df['columnName'].sum(),
df['columnName'].mean(),
df['columnName'].count(),
df['columnName'].skew(),
df.corr()
Read the docs for Pandas. Most of the time, there is a tutorial that you can follow, but not always, and learning to read documentation is crucial to your success as a data scientist.
Past Projects
Look at some sample notebooks for an example of the types of visualizations you can use in your notebook.