Path: blob/master/lessons/lesson_12/data_cleaning.ipynb
1904 views
Kernel: Python 3
Data Cleaning
Datasets are available here: https://drive.google.com/open?id=1A-Y55qgJMB0L-xzeMqmpmzYBPzfP3Vi2
In [1]:
Get List of Files
In [2]:
Out[2]:
example file: AAPL.csv company: AAPL
Read in first file
In [3]:
Out[3]:
In [4]:
Out[4]:
(8924, 8)
Read in second file to demonstrate concatenation
In [5]:
In [6]:
Out[6]:
(4772, 8)
In [7]:
Out[7]:
In [8]:
Out[8]:
In [9]:
Out[9]:
(13696, 8)
In [10]:
Out[10]:
True
Read in rest of the files and append
In [ ]:
In [ ]:
In [ ]:
Group By
You can group data and work with grouped data, similar to how you would in SQL
In [11]:
Out[11]:
In [12]:
Out[12]:
In [14]:
Out[14]:
company
AAPL 818342205800
Name: Volume, dtype: int64
Hit "Tab" after the dot to see additional attributes
In [15]:
Out[15]:
File "<ipython-input-15-0eb7437abb25>", line 1
grouped.
^
SyntaxError: invalid syntax
You can apply multiple aggregations at once
In [17]:
Out[17]:
You can apply different functions to different columns using a dictionary
In [18]:
Out[18]:
Excercise:
Create a new variable "grouped2" where you group our original dataframe by company.
Then choose a column or two and perform several aggregations on it
In [19]:
In [20]:
Out[20]:
Use "Transform" to modify observations within groups
First we will randomly create "NA" values in the 'Close" column and then we will set those NA values to the mean within each group
In [21]:
In [22]:
Out[22]:
C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
"""Entry point for launching an IPython kernel.
C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py:1367: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
return self._getitem_tuple(key)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Excercise:
Create a function and perform a transformation on your grouped data
In [ ]:
A pivot works just like excel
In [ ]:
In [ ]:
Check out the nulls
In [ ]:
In [ ]:
Look at the nulls directly
In [ ]:
Nans are just ignored when we do arithmetic calculations
In [ ]:
We can fill those in with 0's if we want
In [ ]:
Excercise:
Fill in the nulls with another value
In [ ]:
Have to add the 'in place = True'
In [ ]:
Time Series and Indexing
Pandas has many time series functions you can do
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Excercise
Answer the following question:
In [ ]:
Pipes
In [ ]:
In [ ]:
Save some Memory
In [ ]:
Load in a New dataset
In [ ]:
In [2]:
Out[2]:
/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (6,7,22,39,61,69,79,80) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
In [3]:
Out[3]:
(200000, 97)
Look at the Data
In [4]:
Out[4]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 97 columns):
Unnamed: 0 200000 non-null int64
Job # 200000 non-null int64
Doc # 200000 non-null int64
Borough 200000 non-null object
House # 200000 non-null object
Street Name 200000 non-null object
Block 199419 non-null object
Lot 199419 non-null object
Bin # 200000 non-null int64
Job Type 200000 non-null object
Job Status 200000 non-null object
Job Status Descrp 200000 non-null object
Latest Action Date 200000 non-null object
Building Type 200000 non-null object
Community - Board 199855 non-null float64
Cluster 59734 non-null object
Landmarked 191389 non-null object
Adult Estab 163907 non-null object
Loft Board 119834 non-null object
City Owned 10062 non-null object
Little e 56566 non-null object
PC Filed 132700 non-null object
eFiling Filed 50896 non-null object
Plumbing 77611 non-null object
Mechanical 29906 non-null object
Boiler 5332 non-null object
Fuel Burning 1252 non-null object
Fuel Storage 681 non-null object
Standpipe 1216 non-null object
Sprinkler 8015 non-null object
Fire Alarm 4287 non-null object
Equipment 59437 non-null object
Fire Suppression 2423 non-null object
Curb Cut 33471 non-null object
Other 109696 non-null object
Other Description 109656 non-null object
Applicant's First Name 199986 non-null object
Applicant's Last Name 200000 non-null object
Applicant Professional Title 199962 non-null object
Applicant License # 179187 non-null object
Professional Cert 122887 non-null object
Pre- Filing Date 200000 non-null object
Paid 199328 non-null object
Fully Paid 198810 non-null object
Assigned 135054 non-null object
Approved 161232 non-null object
Fully Permitted 136099 non-null object
Initial Cost 200000 non-null object
Total Est. Fee 200000 non-null object
Fee Status 200000 non-null object
Existing Zoning Sqft 200000 non-null int64
Proposed Zoning Sqft 200000 non-null int64
Horizontal Enlrgmt 3838 non-null object
Vertical Enlrgmt 2655 non-null object
Enlargement SQ Footage 200000 non-null int64
Street Frontage 200000 non-null int64
ExistingNo. of Stories 200000 non-null int64
Proposed No. of Stories 200000 non-null int64
Existing Height 200000 non-null int64
Proposed Height 200000 non-null int64
Existing Dwelling Units 31755 non-null float64
Proposed Dwelling Units 96028 non-null object
Existing Occupancy 71634 non-null object
Proposed Occupancy 110700 non-null object
Site Fill 173893 non-null object
Zoning Dist1 147124 non-null object
Zoning Dist2 17589 non-null object
Zoning Dist3 853 non-null object
Special District 1 17178 non-null object
Special District 2 3309 non-null object
Owner Type 185788 non-null object
Non-Profit 186299 non-null object
Owner's First Name 199952 non-null object
Owner's Last Name 199964 non-null object
Owner's Business Name 152176 non-null object
Owner's House Number 199956 non-null object
Owner'sHouse Street Name 199995 non-null object
City 199981 non-null object
State 199973 non-null object
Zip 199920 non-null object
Owner'sPhone # 197069 non-null object
Job Description 91499 non-null object
DOBRunDate 200000 non-null object
JOB_S1_NO 200000 non-null int64
TOTAL_CONSTRUCTION_FLOOR_AREA 200000 non-null int64
WITHDRAWAL_FLAG 200000 non-null int64
SIGNOFF_DATE 117103 non-null object
SPECIAL_ACTION_STATUS 184037 non-null object
SPECIAL_ACTION_DATE 16604 non-null object
BUILDING_CLASS 197846 non-null object
JOB_NO_GOOD_COUNT 200000 non-null int64
GIS_LATITUDE 198805 non-null float64
GIS_LONGITUDE 198805 non-null float64
GIS_COUNCIL_DISTRICT 198805 non-null float64
GIS_CENSUS_TRACT 198805 non-null float64
GIS_NTA_NAME 198805 non-null object
GIS_BIN 194686 non-null float64
dtypes: float64(7), int64(16), object(74)
memory usage: 148.0+ MB
Some of those datatypes will cause a problem for plotting
In [5]:
Out[5]:
Unnamed: 0 int64
Job # int64
Doc # int64
Borough object
House # object
Street Name object
Block object
Lot object
Bin # int64
Job Type object
Job Status object
Job Status Descrp object
Latest Action Date object
Building Type object
Community - Board float64
Cluster object
Landmarked object
Adult Estab object
Loft Board object
City Owned object
Little e object
PC Filed object
eFiling Filed object
Plumbing object
Mechanical object
Boiler object
Fuel Burning object
Fuel Storage object
Standpipe object
Sprinkler object
...
Zoning Dist3 object
Special District 1 object
Special District 2 object
Owner Type object
Non-Profit object
Owner's First Name object
Owner's Last Name object
Owner's Business Name object
Owner's House Number object
Owner'sHouse Street Name object
City object
State object
Zip object
Owner'sPhone # object
Job Description object
DOBRunDate object
JOB_S1_NO int64
TOTAL_CONSTRUCTION_FLOOR_AREA int64
WITHDRAWAL_FLAG int64
SIGNOFF_DATE object
SPECIAL_ACTION_STATUS object
SPECIAL_ACTION_DATE object
BUILDING_CLASS object
JOB_NO_GOOD_COUNT int64
GIS_LATITUDE float64
GIS_LONGITUDE float64
GIS_COUNCIL_DISTRICT float64
GIS_CENSUS_TRACT float64
GIS_NTA_NAME object
GIS_BIN float64
Length: 97, dtype: object
Lets look at the Initial Cost and Total Estimated Fee Columns
In [6]:
Out[6]:
Convert the 'Initial Cost and Total Est. Fee' columns to float
In [7]:
Look at the Borough Column
In [8]:
Out[8]:
BROOKLYN 72853
QUEENS 47902
MANHATTAN 41527
STATEN ISLAND 20247
BRONX 17471
Name: Borough, dtype: int64
In [9]:
Out[9]:
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)
Look at the scatter plot of Initial cost vs Total Estimated. What do those 0's represent
In [10]:
Out[10]:
In [ ]:
In [11]:
Out[11]:
In [12]:
Out[12]:
In [13]:
Out[13]:
In [ ]:
Lets compare regression outputs
In [14]:
In [15]:
Out[15]:
Index(['Unnamed:_0', 'Job_#', 'Doc_#', 'Borough', 'House_#', 'Street_Name',
'Block', 'Lot', 'Bin_#', 'Job_Type', 'Job_Status', 'Job_Status_Descrp',
'Latest_Action_Date', 'Building_Type', 'Community_-_Board', 'Cluster',
'Landmarked', 'Adult_Estab', 'Loft_Board', 'City_Owned', 'Little_e',
'PC_Filed', 'eFiling_Filed', 'Plumbing', 'Mechanical', 'Boiler',
'Fuel_Burning', 'Fuel_Storage', 'Standpipe', 'Sprinkler', 'Fire_Alarm',
'Equipment', 'Fire_Suppression', 'Curb_Cut', 'Other',
'Other_Description', 'Applicant's_First_Name', 'Applicant's_Last_Name',
'Applicant_Professional_Title', 'Applicant_License_#',
'Professional_Cert', 'Pre-_Filing_Date', 'Paid', 'Fully_Paid',
'Assigned', 'Approved', 'Fully_Permitted', 'Initial_Cost',
'Total_Est_Fee', 'Fee_Status', 'Existing_Zoning_Sqft',
'Proposed_Zoning_Sqft', 'Horizontal_Enlrgmt', 'Vertical_Enlrgmt',
'Enlargement_SQ_Footage', 'Street_Frontage', 'ExistingNo_of_Stories',
'Proposed_No_of_Stories', 'Existing_Height', 'Proposed_Height',
'Existing_Dwelling_Units', 'Proposed_Dwelling_Units',
'Existing_Occupancy', 'Proposed_Occupancy', 'Site_Fill', 'Zoning_Dist1',
'Zoning_Dist2', 'Zoning_Dist3', 'Special_District_1',
'Special_District_2', 'Owner_Type', 'Non-Profit', 'Owner's_First_Name',
'Owner's_Last_Name', 'Owner's_Business_Name', 'Owner's_House_Number',
'Owner'sHouse_Street_Name', 'City_', 'State', 'Zip', 'Owner'sPhone_#',
'Job_Description', 'DOBRunDate', 'JOB_S1_NO',
'TOTAL_CONSTRUCTION_FLOOR_AREA', 'WITHDRAWAL_FLAG', 'SIGNOFF_DATE',
'SPECIAL_ACTION_STATUS', 'SPECIAL_ACTION_DATE', 'BUILDING_CLASS',
'JOB_NO_GOOD_COUNT', 'GIS_LATITUDE', 'GIS_LONGITUDE',
'GIS_COUNCIL_DISTRICT', 'GIS_CENSUS_TRACT', 'GIS_NTA_NAME', 'GIS_BIN'],
dtype='object')
In [16]:
In [17]:
Out[17]:
[(array(['BRONX'], dtype=object),
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x10931eb00>),
(array(['BROOKLYN'], dtype=object),
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x109422518>),
(array(['MANHATTAN'], dtype=object),
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x109304d30>),
(array(['QUEENS'], dtype=object),
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x109422c18>),
(array(['STATEN ISLAND'], dtype=object),
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x10f97e630>)]
In [18]:
Out[18]:
['BRONX']
OLS Regression Results
==============================================================================
Dep. Variable: Total_Est_Fee R-squared: 0.914
Model: OLS Adj. R-squared: 0.914
Method: Least Squares F-statistic: 3.575e+04
Date: Wed, 23 May 2018 Prob (F-statistic): 0.00
Time: 20:10:56 Log-Likelihood: -33004.
No. Observations: 3374 AIC: 6.601e+04
Df Residuals: 3372 BIC: 6.602e+04
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -329.1929 75.125 -4.382 0.000 -476.488 -181.898
Initial_Cost 0.0125 6.6e-05 189.083 0.000 0.012 0.013
==============================================================================
Omnibus: 7979.867 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 252323646.504
Skew: 22.723 Prob(JB): 0.00
Kurtosis: 1341.943 Cond. No. 1.16e+06
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
['BROOKLYN']
OLS Regression Results
==============================================================================
Dep. Variable: Total_Est_Fee R-squared: 0.796
Model: OLS Adj. R-squared: 0.796
Method: Least Squares F-statistic: 1.220e+05
Date: Wed, 23 May 2018 Prob (F-statistic): 0.00
Time: 20:10:56 Log-Likelihood: -3.1998e+05
No. Observations: 31186 AIC: 6.400e+05
Df Residuals: 31184 BIC: 6.400e+05
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 215.1242 39.312 5.472 0.000 138.071 292.177
Initial_Cost 0.0088 2.53e-05 349.307 0.000 0.009 0.009
==============================================================================
Omnibus: 119771.922 Durbin-Watson: 1.996
Prob(Omnibus): 0.000 Jarque-Bera (JB): 302567458230.073
Skew: -89.393 Prob(JB): 0.00
Kurtosis: 15261.333 Cond. No. 1.56e+06
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.56e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
['MANHATTAN']
OLS Regression Results
==============================================================================
Dep. Variable: Total_Est_Fee R-squared: 0.967
Model: OLS Adj. R-squared: 0.967
Method: Least Squares F-statistic: 7.302e+05
Date: Wed, 23 May 2018 Prob (F-statistic): 0.00
Time: 20:10:56 Log-Likelihood: -2.4991e+05
No. Observations: 24608 AIC: 4.998e+05
Df Residuals: 24606 BIC: 4.998e+05
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 90.0480 39.839 2.260 0.024 11.962 168.134
Initial_Cost 0.0105 1.23e-05 854.495 0.000 0.010 0.011
==============================================================================
Omnibus: 66181.946 Durbin-Watson: 2.004
Prob(Omnibus): 0.000 Jarque-Bera (JB): 34443300750.119
Skew: 31.456 Prob(JB): 0.00
Kurtosis: 5798.545 Cond. No. 3.26e+06
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
['QUEENS']
OLS Regression Results
==============================================================================
Dep. Variable: Total_Est_Fee R-squared: 0.962
Model: OLS Adj. R-squared: 0.962
Method: Least Squares F-statistic: 2.236e+05
Date: Wed, 23 May 2018 Prob (F-statistic): 0.00
Time: 20:10:56 Log-Likelihood: -81202.
No. Observations: 8785 AIC: 1.624e+05
Df Residuals: 8783 BIC: 1.624e+05
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 5.4920 26.893 0.204 0.838 -47.224 58.208
Initial_Cost 0.0100 2.12e-05 472.863 0.000 0.010 0.010
==============================================================================
Omnibus: 5206.919 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4165718100.760
Skew: 0.308 Prob(JB): 0.00
Kurtosis: 3376.492 Cond. No. 1.28e+06
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.28e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
['STATEN ISLAND']
OLS Regression Results
==============================================================================
Dep. Variable: Total_Est_Fee R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 9.305e+05
Date: Wed, 23 May 2018 Prob (F-statistic): 0.00
Time: 20:10:56 Log-Likelihood: -15273.
No. Observations: 1969 AIC: 3.055e+04
Df Residuals: 1967 BIC: 3.056e+04
Df Model: 1
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -41.4691 12.821 -3.234 0.001 -66.613 -16.325
Initial_Cost 0.0103 1.07e-05 964.608 0.000 0.010 0.010
==============================================================================
Omnibus: 1809.142 Durbin-Watson: 1.896
Prob(Omnibus): 0.000 Jarque-Bera (JB): 673436.226
Skew: -3.473 Prob(JB): 0.00
Kurtosis: 93.334 Cond. No. 1.21e+06
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.21e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [19]:
Out[19]:
[(array(['BRONX'], dtype=object), 0.012478883889366172),
(array(['BROOKLYN'], dtype=object), 0.008849756985075879),
(array(['MANHATTAN'], dtype=object), 0.010478379863994247),
(array(['QUEENS'], dtype=object), 0.010001338677093165),
(array(['STATEN ISLAND'], dtype=object), 0.010302690934398567)]
In [20]:
Out[20]:
In [ ]: