GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_12/data_cleaning.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Data Cleaning

Datasets are available here: https://drive.google.com/open?id=1A-Y55qgJMB0L-xzeMqmpmzYBPzfP3Vi2

In [1]:

import pandas as pd
import numpy as np
import os

Get List of Files

In [2]:

PATH = 'python-notebooks-data-wrangling/data/financial/raw/companies'
files_to_read = os.listdir(PATH) #pulls in list of all file names in directory
company_names = [i.split('.')[0] for i in files_to_read]
print('example file: ', files_to_read[0], 'company: ', company_names[0])

Out[2]:

example file:  AAPL.csv company:  AAPL

Read in first file

In [3]:

df = pd.read_csv(os.path.join(PATH,files_to_read[0]))
df['company'] = company_names[0]
df.head()

Out[3]:

In [4]:

df.shape

Out[4]:

(8924, 8)

Read in second file to demonstrate concatenation

In [5]:

df2 = pd.read_csv(os.path.join(PATH, files_to_read[1]))
df2['company'] = company_names[1]

In [6]:

df2.shape

Out[6]:

(4772, 8)

In [7]:

df3 = pd.concat([df, df2])
df3.head()

Out[7]:

In [8]:

df3.tail()

Out[8]:

In [9]:

df3.shape

Out[9]:

(13696, 8)

In [10]:

df3.shape[0] == df.shape[0] + df2.shape[0]

Out[10]:

True

Read in rest of the files and append

In [ ]:

for name, file in zip(company_names[1:], files_to_read[1:]):
    print('reading: ', name)
    new_df = pd.read_csv(os.path.join(PATH,file))
    new_df['company'] = name
    df = pd.concat([df, new_df])

In [ ]:

df.shape

In [ ]:

df.info()

Group By

You can group data and work with grouped data, similar to how you would in SQL

In [11]:

grouped = df.groupby('company')
grouped.get_group('AAPL')

Out[11]:

In [12]:

grouped.head() #gives head for each first 5 of each company

Out[12]:

In [14]:

grouped['Volume'].sum()

Out[14]:

company
AAPL    818342205800
Name: Volume, dtype: int64

Hit "Tab" after the dot to see additional attributes

In [15]:

grouped.

Out[15]:

  File "<ipython-input-15-0eb7437abb25>", line 1
    grouped.
            ^
SyntaxError: invalid syntax

You can apply multiple aggregations at once

In [17]:

grouped['Volume'].agg(['sum','mean']).head() #.agg allows for different aggregations, which are passed as strings

Out[17]:

You can apply different functions to different columns using a dictionary

In [18]:

grouped.agg({'Volume' : 'sum', 'Close' : 'mean'}).head()

Out[18]:

Excercise:

Create a new variable "grouped2" where you group our original dataframe by company.
Then choose a column or two and perform several aggregations on it

In [19]:

grouped2 = df.groupby('company')

In [20]:

grouped2['High', 'Low', 'Close'].agg(['min','max', 'mean']).head()

Out[20]:

Use "Transform" to modify observations within groups

First we will randomly create "NA" values in the 'Close" column and then we will set those NA values to the mean within each group

In [21]:

# df.loc['row_filter', 'select columns']

In [22]:

df.loc[df['High']>50, ['Close', 'Company']].head()

Out[22]:

C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.
C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py:1367: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)

In [ ]:

df.loc[np.random.choice(df.index, size = 2000), 'Close'] = np.nan

In [ ]:

df.info()

In [ ]:

df.loc[df['Close'].isnull(),['company','Close']]

In [ ]:

len(df.loc[df['Close'].isnull(),['company','Close']])

In [ ]:

## Define the functiont to use for transform
def get_mean(x):
    return x.fillna(x.mean())

grouped = df.groupby('company')

transformed = grouped.transform(get_mean) 

#transform, you always pass a funtion and it gets applied to the variable "grouped" in this case

#this is instead of writing the loop below

#for group in grouped:
#   group = get_mean(group)

In [ ]:

transformed.info()

Excercise:

Create a function and perform a transformation on your grouped data

In [ ]:

A pivot works just like excel

In [ ]:

pivoted = df.pivot(index = 'Date', columns='company', values = 'Close')

In [ ]:

pivoted.tail()

Check out the nulls

In [ ]:

pivoted.info()

In [ ]:

pivoted.isnull().sum()

Look at the nulls directly

In [ ]:

pivoted.isna()
#list(pivoted.isna()==False)

Nans are just ignored when we do arithmetic calculations

In [ ]:

pivoted.mean()

We can fill those in with 0's if we want

In [ ]:

pivoted.fillna(0)
pivoted.head()

Excercise:

Fill in the nulls with another value

In [ ]:

#pivoted.fillna(pivoted.mean(), inplace=True)
pivoted.head()

Have to add the 'in place = True'

In [ ]:

#pivoted.fillna(0, inplace=True)
pivoted.head()

Time Series and Indexing

Pandas has many time series functions you can do

In [ ]:

pivoted.index.dtype

In [ ]:

pivoted.index = pd.to_datetime(pivoted.index)

In [ ]:

pivoted.index

In [ ]:

pivoted.index.dtype

In [ ]:

pivoted['month'] = pivoted.index.month

In [ ]:

pivoted.groupby(['month']).mean()

In [ ]:

pivoted['year'] = pivoted.index.year

In [ ]:

pivoted.groupby(['year','month']).mean().tail()

In [ ]:

pivoted.dropna(inplace=True)

In [ ]:

pivoted.pct_change()

In [ ]:

pivoted.resample('M').mean().head() #gives the avg. by month
pivoted.resample('W').mean().head() #gives the avg. by week
pivoted.resample('M').last().head() #gives the last value by month

Excercise

Answer the following question:

Which month does Apple stock have the highest average monthly return?  Which month has the lowest?

In [ ]:

pivoted.groupby(['month']).mean().pct_change()['AAPL'].sort_values()*100

Pipes

In [ ]:

def monthly_avg(df):
    return df.resample('m').mean()

def pct_change(df):
    return df.pct_change()

def rolling_mean(df, period):
    return df.rolling(window = period).mean() #rolling function in pandas

def dropna(df):
    return df.dropna()

rolling_3m = (pivoted.pipe(dropna)
              .pipe(monthly_avg)
              .pipe(pct_change)
              .pipe(rolling_mean, period = 3))

In [ ]:

rolling_3m

Save some Memory

In [ ]:

del(df)
del(pivoted)

Load in a New dataset

In [ ]:

#nyc dept of buildings application dataset

In [2]:

df = pd.read_csv('DOB_job_filings.csv')

Out[2]:

/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (6,7,22,39,61,69,79,80) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [3]:

df.shape

Out[3]:

(200000, 97)

Look at the Data

In [4]:

df.info()

Out[4]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 97 columns):
Unnamed: 0                       200000 non-null int64
Job #                            200000 non-null int64
Doc #                            200000 non-null int64
Borough                          200000 non-null object
House #                          200000 non-null object
Street Name                      200000 non-null object
Block                            199419 non-null object
Lot                              199419 non-null object
Bin #                            200000 non-null int64
Job Type                         200000 non-null object
Job Status                       200000 non-null object
Job Status Descrp                200000 non-null object
Latest Action Date               200000 non-null object
Building Type                    200000 non-null object
Community - Board                199855 non-null float64
Cluster                          59734 non-null object
Landmarked                       191389 non-null object
Adult Estab                      163907 non-null object
Loft Board                       119834 non-null object
City Owned                       10062 non-null object
Little e                         56566 non-null object
PC Filed                         132700 non-null object
eFiling Filed                    50896 non-null object
Plumbing                         77611 non-null object
Mechanical                       29906 non-null object
Boiler                           5332 non-null object
Fuel Burning                     1252 non-null object
Fuel Storage                     681 non-null object
Standpipe                        1216 non-null object
Sprinkler                        8015 non-null object
Fire Alarm                       4287 non-null object
Equipment                        59437 non-null object
Fire Suppression                 2423 non-null object
Curb Cut                         33471 non-null object
Other                            109696 non-null object
Other Description                109656 non-null object
Applicant's First Name           199986 non-null object
Applicant's Last Name            200000 non-null object
Applicant Professional Title     199962 non-null object
Applicant License #              179187 non-null object
Professional Cert                122887 non-null object
Pre- Filing Date                 200000 non-null object
Paid                             199328 non-null object
Fully Paid                       198810 non-null object
Assigned                         135054 non-null object
Approved                         161232 non-null object
Fully Permitted                  136099 non-null object
Initial Cost                     200000 non-null object
Total Est. Fee                   200000 non-null object
Fee Status                       200000 non-null object
Existing Zoning Sqft             200000 non-null int64
Proposed Zoning Sqft             200000 non-null int64
Horizontal Enlrgmt               3838 non-null object
Vertical Enlrgmt                 2655 non-null object
Enlargement SQ Footage           200000 non-null int64
Street Frontage                  200000 non-null int64
ExistingNo. of Stories           200000 non-null int64
Proposed No. of Stories          200000 non-null int64
Existing Height                  200000 non-null int64
Proposed Height                  200000 non-null int64
Existing Dwelling Units          31755 non-null float64
Proposed Dwelling Units          96028 non-null object
Existing Occupancy               71634 non-null object
Proposed Occupancy               110700 non-null object
Site Fill                        173893 non-null object
Zoning Dist1                     147124 non-null object
Zoning Dist2                     17589 non-null object
Zoning Dist3                     853 non-null object
Special District 1               17178 non-null object
Special District 2               3309 non-null object
Owner Type                       185788 non-null object
Non-Profit                       186299 non-null object
Owner's First Name               199952 non-null object
Owner's Last Name                199964 non-null object
Owner's Business Name            152176 non-null object
Owner's House Number             199956 non-null object
Owner'sHouse Street Name         199995 non-null object
City                             199981 non-null object
State                            199973 non-null object
Zip                              199920 non-null object
Owner'sPhone #                   197069 non-null object
Job Description                  91499 non-null object
DOBRunDate                       200000 non-null object
JOB_S1_NO                        200000 non-null int64
TOTAL_CONSTRUCTION_FLOOR_AREA    200000 non-null int64
WITHDRAWAL_FLAG                  200000 non-null int64
SIGNOFF_DATE                     117103 non-null object
SPECIAL_ACTION_STATUS            184037 non-null object
SPECIAL_ACTION_DATE              16604 non-null object
BUILDING_CLASS                   197846 non-null object
JOB_NO_GOOD_COUNT                200000 non-null int64
GIS_LATITUDE                     198805 non-null float64
GIS_LONGITUDE                    198805 non-null float64
GIS_COUNCIL_DISTRICT             198805 non-null float64
GIS_CENSUS_TRACT                 198805 non-null float64
GIS_NTA_NAME                     198805 non-null object
GIS_BIN                          194686 non-null float64
dtypes: float64(7), int64(16), object(74)
memory usage: 148.0+ MB

Some of those datatypes will cause a problem for plotting

In [5]:

df.dtypes

Out[5]:

Unnamed: 0                         int64
Job #                              int64
Doc #                              int64
Borough                           object
House #                           object
Street Name                       object
Block                             object
Lot                               object
Bin #                              int64
Job Type                          object
Job Status                        object
Job Status Descrp                 object
Latest Action Date                object
Building Type                     object
Community - Board                float64
Cluster                           object
Landmarked                        object
Adult Estab                       object
Loft Board                        object
City Owned                        object
Little e                          object
PC Filed                          object
eFiling Filed                     object
Plumbing                          object
Mechanical                        object
Boiler                            object
Fuel Burning                      object
Fuel Storage                      object
Standpipe                         object
Sprinkler                         object
                                  ...   
Zoning Dist3                      object
Special District 1                object
Special District 2                object
Owner Type                        object
Non-Profit                        object
Owner's First Name                object
Owner's Last Name                 object
Owner's Business Name             object
Owner's House Number              object
Owner'sHouse Street Name          object
City                              object
State                             object
Zip                               object
Owner'sPhone #                    object
Job Description                   object
DOBRunDate                        object
JOB_S1_NO                          int64
TOTAL_CONSTRUCTION_FLOOR_AREA      int64
WITHDRAWAL_FLAG                    int64
SIGNOFF_DATE                      object
SPECIAL_ACTION_STATUS             object
SPECIAL_ACTION_DATE               object
BUILDING_CLASS                    object
JOB_NO_GOOD_COUNT                  int64
GIS_LATITUDE                     float64
GIS_LONGITUDE                    float64
GIS_COUNCIL_DISTRICT             float64
GIS_CENSUS_TRACT                 float64
GIS_NTA_NAME                      object
GIS_BIN                          float64
Length: 97, dtype: object

Lets look at the Initial Cost and Total Estimated Fee Columns

In [6]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
df[['Initial Cost', 'Total Est. Fee']].head()

Out[6]:

Convert the 'Initial Cost and Total Est. Fee' columns to float

In [7]:

df['Initial Cost'] = df['Initial Cost'].str.replace('$','').astype('float')
df['Total Est. Fee'] = df['Total Est. Fee'].str.replace("$", '').astype('float')

Look at the Borough Column

In [8]:

df['Borough'].value_counts(dropna=False)

Out[8]:

BROOKLYN         72853
QUEENS           47902
MANHATTAN        41527
STATEN ISLAND    20247
BRONX            17471
Name: Borough, dtype: int64

In [9]:

sns.factorplot(kind='box', x='Borough', y='Initial Cost', data=df)
plt.xticks(rotation = 70)

Out[9]:

(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)

Look at the scatter plot of Initial cost vs Total Estimated. What do those 0's represent

In [10]:

def show_scatter():
    g = sns.FacetGrid(df, col='Borough', sharex=False, sharey=False)
    g.map(plt.scatter, 'Initial Cost', 'Total Est. Fee')
show_scatter()

Out[10]:

In [ ]:

#there are a lot of values at x=0, perhaps that is all the nan values, hypothesis

In [11]:

cols = ['Initial Cost', 'Total Est. Fee', 'Job Type', 'Job Status', 'Job Status Descrp','Latest Action Date']
df.loc[df['Initial Cost']==0,cols].head()

Out[11]:

In [12]:

df.loc[df['Initial Cost']==0,].describe()

Out[12]:

In [13]:

import numpy as np
df.loc[df['Initial Cost']==0, 'Initial Cost'] = np.nan
show_scatter()

Out[13]:

In [ ]:

#this removed the zero's that were due to nans

Lets compare regression outputs

In [14]:

df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace('.', '')

In [15]:

df.columns

Out[15]:

Index(['Unnamed:_0', 'Job_#', 'Doc_#', 'Borough', 'House_#', 'Street_Name',
       'Block', 'Lot', 'Bin_#', 'Job_Type', 'Job_Status', 'Job_Status_Descrp',
       'Latest_Action_Date', 'Building_Type', 'Community_-_Board', 'Cluster',
       'Landmarked', 'Adult_Estab', 'Loft_Board', 'City_Owned', 'Little_e',
       'PC_Filed', 'eFiling_Filed', 'Plumbing', 'Mechanical', 'Boiler',
       'Fuel_Burning', 'Fuel_Storage', 'Standpipe', 'Sprinkler', 'Fire_Alarm',
       'Equipment', 'Fire_Suppression', 'Curb_Cut', 'Other',
       'Other_Description', 'Applicant's_First_Name', 'Applicant's_Last_Name',
       'Applicant_Professional_Title', 'Applicant_License_#',
       'Professional_Cert', 'Pre-_Filing_Date', 'Paid', 'Fully_Paid',
       'Assigned', 'Approved', 'Fully_Permitted', 'Initial_Cost',
       'Total_Est_Fee', 'Fee_Status', 'Existing_Zoning_Sqft',
       'Proposed_Zoning_Sqft', 'Horizontal_Enlrgmt', 'Vertical_Enlrgmt',
       'Enlargement_SQ_Footage', 'Street_Frontage', 'ExistingNo_of_Stories',
       'Proposed_No_of_Stories', 'Existing_Height', 'Proposed_Height',
       'Existing_Dwelling_Units', 'Proposed_Dwelling_Units',
       'Existing_Occupancy', 'Proposed_Occupancy', 'Site_Fill', 'Zoning_Dist1',
       'Zoning_Dist2', 'Zoning_Dist3', 'Special_District_1',
       'Special_District_2', 'Owner_Type', 'Non-Profit', 'Owner's_First_Name',
       'Owner's_Last_Name', 'Owner's_Business_Name', 'Owner's_House_Number',
       'Owner'sHouse_Street_Name', 'City_', 'State', 'Zip', 'Owner'sPhone_#',
       'Job_Description', 'DOBRunDate', 'JOB_S1_NO',
       'TOTAL_CONSTRUCTION_FLOOR_AREA', 'WITHDRAWAL_FLAG', 'SIGNOFF_DATE',
       'SPECIAL_ACTION_STATUS', 'SPECIAL_ACTION_DATE', 'BUILDING_CLASS',
       'JOB_NO_GOOD_COUNT', 'GIS_LATITUDE', 'GIS_LONGITUDE',
       'GIS_COUNCIL_DISTRICT', 'GIS_CENSUS_TRACT', 'GIS_NTA_NAME', 'GIS_BIN'],
      dtype='object')

In [16]:

import statsmodels.formula.api as smf

def regress(df, borough):
    return (borough, smf.ols('Total_Est_Fee ~ Initial_Cost', data = df, missing='drop').fit())


separate_regressions = [i for i in df.groupby('Borough')
                        .apply(lambda x: regress(x, x['Borough'].unique()))]

#the above skips the below
#for group in df.groupby('Borough'):
#    regress(group)

In [17]:

separate_regressions

Out[17]:

[(array(['BRONX'], dtype=object),
  <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x10931eb00>),
 (array(['BROOKLYN'], dtype=object),
  <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x109422518>),
 (array(['MANHATTAN'], dtype=object),
  <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x109304d30>),
 (array(['QUEENS'], dtype=object),
  <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x109422c18>),
 (array(['STATEN ISLAND'], dtype=object),
  <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x10f97e630>)]

In [18]:

for i in separate_regressions:
    print(i[0])
    print(i[1].summary())

Out[18]:

['BRONX']
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          Total_Est_Fee   R-squared:                       0.914
Model:                            OLS   Adj. R-squared:                  0.914
Method:                 Least Squares   F-statistic:                 3.575e+04
Date:                Wed, 23 May 2018   Prob (F-statistic):               0.00
Time:                        20:10:56   Log-Likelihood:                -33004.
No. Observations:                3374   AIC:                         6.601e+04
Df Residuals:                    3372   BIC:                         6.602e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept     -329.1929     75.125     -4.382      0.000    -476.488    -181.898
Initial_Cost     0.0125    6.6e-05    189.083      0.000       0.012       0.013
==============================================================================
Omnibus:                     7979.867   Durbin-Watson:                   1.994
Prob(Omnibus):                  0.000   Jarque-Bera (JB):        252323646.504
Skew:                          22.723   Prob(JB):                         0.00
Kurtosis:                    1341.943   Cond. No.                     1.16e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
['BROOKLYN']
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          Total_Est_Fee   R-squared:                       0.796
Model:                            OLS   Adj. R-squared:                  0.796
Method:                 Least Squares   F-statistic:                 1.220e+05
Date:                Wed, 23 May 2018   Prob (F-statistic):               0.00
Time:                        20:10:56   Log-Likelihood:            -3.1998e+05
No. Observations:               31186   AIC:                         6.400e+05
Df Residuals:                   31184   BIC:                         6.400e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept      215.1242     39.312      5.472      0.000     138.071     292.177
Initial_Cost     0.0088   2.53e-05    349.307      0.000       0.009       0.009
==============================================================================
Omnibus:                   119771.922   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.000   Jarque-Bera (JB):     302567458230.073
Skew:                         -89.393   Prob(JB):                         0.00
Kurtosis:                   15261.333   Cond. No.                     1.56e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.56e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
['MANHATTAN']
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          Total_Est_Fee   R-squared:                       0.967
Model:                            OLS   Adj. R-squared:                  0.967
Method:                 Least Squares   F-statistic:                 7.302e+05
Date:                Wed, 23 May 2018   Prob (F-statistic):               0.00
Time:                        20:10:56   Log-Likelihood:            -2.4991e+05
No. Observations:               24608   AIC:                         4.998e+05
Df Residuals:                   24606   BIC:                         4.998e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       90.0480     39.839      2.260      0.024      11.962     168.134
Initial_Cost     0.0105   1.23e-05    854.495      0.000       0.010       0.011
==============================================================================
Omnibus:                    66181.946   Durbin-Watson:                   2.004
Prob(Omnibus):                  0.000   Jarque-Bera (JB):      34443300750.119
Skew:                          31.456   Prob(JB):                         0.00
Kurtosis:                    5798.545   Cond. No.                     3.26e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.26e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
['QUEENS']
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          Total_Est_Fee   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.962
Method:                 Least Squares   F-statistic:                 2.236e+05
Date:                Wed, 23 May 2018   Prob (F-statistic):               0.00
Time:                        20:10:56   Log-Likelihood:                -81202.
No. Observations:                8785   AIC:                         1.624e+05
Df Residuals:                    8783   BIC:                         1.624e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        5.4920     26.893      0.204      0.838     -47.224      58.208
Initial_Cost     0.0100   2.12e-05    472.863      0.000       0.010       0.010
==============================================================================
Omnibus:                     5206.919   Durbin-Watson:                   1.989
Prob(Omnibus):                  0.000   Jarque-Bera (JB):       4165718100.760
Skew:                           0.308   Prob(JB):                         0.00
Kurtosis:                    3376.492   Cond. No.                     1.28e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.28e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
['STATEN ISLAND']
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          Total_Est_Fee   R-squared:                       0.998
Model:                            OLS   Adj. R-squared:                  0.998
Method:                 Least Squares   F-statistic:                 9.305e+05
Date:                Wed, 23 May 2018   Prob (F-statistic):               0.00
Time:                        20:10:56   Log-Likelihood:                -15273.
No. Observations:                1969   AIC:                         3.055e+04
Df Residuals:                    1967   BIC:                         3.056e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept      -41.4691     12.821     -3.234      0.001     -66.613     -16.325
Initial_Cost     0.0103   1.07e-05    964.608      0.000       0.010       0.010
==============================================================================
Omnibus:                     1809.142   Durbin-Watson:                   1.896
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           673436.226
Skew:                          -3.473   Prob(JB):                         0.00
Kurtosis:                      93.334   Cond. No.                     1.21e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.21e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

In [19]:

[(i[0], i[1].params[1]) for i in separate_regressions]

Out[19]:

[(array(['BRONX'], dtype=object), 0.012478883889366172),
 (array(['BROOKLYN'], dtype=object), 0.008849756985075879),
 (array(['MANHATTAN'], dtype=object), 0.010478379863994247),
 (array(['QUEENS'], dtype=object), 0.010001338677093165),
 (array(['STATEN ISLAND'], dtype=object), 0.010302690934398567)]

In [20]:

pd.DataFrame([(i[0], round(i[1].params[1], 4)) for i in separate_regressions], columns = ['Borough', 'Beta'])

Out[20]:

In [ ]:

Data Cleaning

Get List of Files

Read in first file

Read in second file to demonstrate concatenation

Read in rest of the files and append

Group By

Hit "Tab" after the dot to see additional attributes

You can apply multiple aggregations at once

You can apply different functions to different columns using a dictionary

Excercise:

Use "Transform" to modify observations within groups

Excercise:

A pivot works just like excel

Check out the nulls

Look at the nulls directly

Nans are just ignored when we do arithmetic calculations

We can fill those in with 0's if we want

Excercise:

Have to add the 'in place = True'

Time Series and Indexing

Excercise

Pipes

Save some Memory

Load in a New dataset

Look at the Data

Some of those datatypes will cause a problem for plotting

Lets look at the Initial Cost and Total Estimated Fee Columns

Convert the 'Initial Cost and Total Est. Fee' columns to float

Look at the Borough Column

Look at the scatter plot of Initial cost vs Total Estimated. What do those 0's represent

Lets compare regression outputs

Product

Resources

Company