Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_12/data_cleaning.ipynb
1904 views
Kernel: Python 3
import pandas as pd import numpy as np import os

Get List of Files

PATH = 'python-notebooks-data-wrangling/data/financial/raw/companies' files_to_read = os.listdir(PATH) #pulls in list of all file names in directory company_names = [i.split('.')[0] for i in files_to_read] print('example file: ', files_to_read[0], 'company: ', company_names[0])
example file: AAPL.csv company: AAPL

Read in first file

df = pd.read_csv(os.path.join(PATH,files_to_read[0])) df['company'] = company_names[0] df.head()
df.shape
(8924, 8)

Read in second file to demonstrate concatenation

df2 = pd.read_csv(os.path.join(PATH, files_to_read[1])) df2['company'] = company_names[1]
df2.shape
(4772, 8)
df3 = pd.concat([df, df2]) df3.head()
df3.tail()
df3.shape
(13696, 8)
df3.shape[0] == df.shape[0] + df2.shape[0]
True

Read in rest of the files and append

for name, file in zip(company_names[1:], files_to_read[1:]): print('reading: ', name) new_df = pd.read_csv(os.path.join(PATH,file)) new_df['company'] = name df = pd.concat([df, new_df])
df.shape
df.info()

Group By

You can group data and work with grouped data, similar to how you would in SQL

grouped = df.groupby('company') grouped.get_group('AAPL')
grouped.head() #gives head for each first 5 of each company
grouped['Volume'].sum()
company AAPL 818342205800 Name: Volume, dtype: int64

Hit "Tab" after the dot to see additional attributes

grouped.
File "<ipython-input-15-0eb7437abb25>", line 1 grouped. ^ SyntaxError: invalid syntax

You can apply multiple aggregations at once

grouped['Volume'].agg(['sum','mean']).head() #.agg allows for different aggregations, which are passed as strings

You can apply different functions to different columns using a dictionary

grouped.agg({'Volume' : 'sum', 'Close' : 'mean'}).head()

Excercise:

  • Create a new variable "grouped2" where you group our original dataframe by company.

  • Then choose a column or two and perform several aggregations on it

grouped2 = df.groupby('company')
grouped2['High', 'Low', 'Close'].agg(['min','max', 'mean']).head()

Use "Transform" to modify observations within groups

First we will randomly create "NA" values in the 'Close" column and then we will set those NA values to the mean within each group

# df.loc['row_filter', 'select columns']
df.loc[df['High']>50, ['Close', 'Company']].head()
C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike """Entry point for launching an IPython kernel. C:\Users\ystrano\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py:1367: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike return self._getitem_tuple(key)
df.loc[np.random.choice(df.index, size = 2000), 'Close'] = np.nan
df.info()
df.loc[df['Close'].isnull(),['company','Close']]
len(df.loc[df['Close'].isnull(),['company','Close']])
## Define the functiont to use for transform def get_mean(x): return x.fillna(x.mean()) grouped = df.groupby('company') transformed = grouped.transform(get_mean) #transform, you always pass a funtion and it gets applied to the variable "grouped" in this case #this is instead of writing the loop below #for group in grouped: # group = get_mean(group)
transformed.info()

Excercise:

  • Create a function and perform a transformation on your grouped data

A pivot works just like excel

pivoted = df.pivot(index = 'Date', columns='company', values = 'Close')
pivoted.tail()

Check out the nulls

pivoted.info()
pivoted.isnull().sum()

Look at the nulls directly

pivoted.isna() #list(pivoted.isna()==False)

Nans are just ignored when we do arithmetic calculations

pivoted.mean()

We can fill those in with 0's if we want

pivoted.fillna(0) pivoted.head()

Excercise:

Fill in the nulls with another value

#pivoted.fillna(pivoted.mean(), inplace=True) pivoted.head()

Have to add the 'in place = True'

#pivoted.fillna(0, inplace=True) pivoted.head()

Time Series and Indexing

Pandas has many time series functions you can do

pivoted.index.dtype
pivoted.index = pd.to_datetime(pivoted.index)
pivoted.index
pivoted.index.dtype
pivoted['month'] = pivoted.index.month
pivoted.groupby(['month']).mean()
pivoted['year'] = pivoted.index.year
pivoted.groupby(['year','month']).mean().tail()
pivoted.dropna(inplace=True)
pivoted.pct_change()
pivoted.resample('M').mean().head() #gives the avg. by month pivoted.resample('W').mean().head() #gives the avg. by week pivoted.resample('M').last().head() #gives the last value by month

Excercise

Answer the following question:

Which month does Apple stock have the highest average monthly return? Which month has the lowest?
pivoted.groupby(['month']).mean().pct_change()['AAPL'].sort_values()*100

Pipes

def monthly_avg(df): return df.resample('m').mean() def pct_change(df): return df.pct_change() def rolling_mean(df, period): return df.rolling(window = period).mean() #rolling function in pandas def dropna(df): return df.dropna() rolling_3m = (pivoted.pipe(dropna) .pipe(monthly_avg) .pipe(pct_change) .pipe(rolling_mean, period = 3))
rolling_3m

Save some Memory

del(df) del(pivoted)

Load in a New dataset

#nyc dept of buildings application dataset
df = pd.read_csv('DOB_job_filings.csv')
/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (6,7,22,39,61,69,79,80) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
df.shape
(200000, 97)

Look at the Data

df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200000 entries, 0 to 199999 Data columns (total 97 columns): Unnamed: 0 200000 non-null int64 Job # 200000 non-null int64 Doc # 200000 non-null int64 Borough 200000 non-null object House # 200000 non-null object Street Name 200000 non-null object Block 199419 non-null object Lot 199419 non-null object Bin # 200000 non-null int64 Job Type 200000 non-null object Job Status 200000 non-null object Job Status Descrp 200000 non-null object Latest Action Date 200000 non-null object Building Type 200000 non-null object Community - Board 199855 non-null float64 Cluster 59734 non-null object Landmarked 191389 non-null object Adult Estab 163907 non-null object Loft Board 119834 non-null object City Owned 10062 non-null object Little e 56566 non-null object PC Filed 132700 non-null object eFiling Filed 50896 non-null object Plumbing 77611 non-null object Mechanical 29906 non-null object Boiler 5332 non-null object Fuel Burning 1252 non-null object Fuel Storage 681 non-null object Standpipe 1216 non-null object Sprinkler 8015 non-null object Fire Alarm 4287 non-null object Equipment 59437 non-null object Fire Suppression 2423 non-null object Curb Cut 33471 non-null object Other 109696 non-null object Other Description 109656 non-null object Applicant's First Name 199986 non-null object Applicant's Last Name 200000 non-null object Applicant Professional Title 199962 non-null object Applicant License # 179187 non-null object Professional Cert 122887 non-null object Pre- Filing Date 200000 non-null object Paid 199328 non-null object Fully Paid 198810 non-null object Assigned 135054 non-null object Approved 161232 non-null object Fully Permitted 136099 non-null object Initial Cost 200000 non-null object Total Est. Fee 200000 non-null object Fee Status 200000 non-null object Existing Zoning Sqft 200000 non-null int64 Proposed Zoning Sqft 200000 non-null int64 Horizontal Enlrgmt 3838 non-null object Vertical Enlrgmt 2655 non-null object Enlargement SQ Footage 200000 non-null int64 Street Frontage 200000 non-null int64 ExistingNo. of Stories 200000 non-null int64 Proposed No. of Stories 200000 non-null int64 Existing Height 200000 non-null int64 Proposed Height 200000 non-null int64 Existing Dwelling Units 31755 non-null float64 Proposed Dwelling Units 96028 non-null object Existing Occupancy 71634 non-null object Proposed Occupancy 110700 non-null object Site Fill 173893 non-null object Zoning Dist1 147124 non-null object Zoning Dist2 17589 non-null object Zoning Dist3 853 non-null object Special District 1 17178 non-null object Special District 2 3309 non-null object Owner Type 185788 non-null object Non-Profit 186299 non-null object Owner's First Name 199952 non-null object Owner's Last Name 199964 non-null object Owner's Business Name 152176 non-null object Owner's House Number 199956 non-null object Owner'sHouse Street Name 199995 non-null object City 199981 non-null object State 199973 non-null object Zip 199920 non-null object Owner'sPhone # 197069 non-null object Job Description 91499 non-null object DOBRunDate 200000 non-null object JOB_S1_NO 200000 non-null int64 TOTAL_CONSTRUCTION_FLOOR_AREA 200000 non-null int64 WITHDRAWAL_FLAG 200000 non-null int64 SIGNOFF_DATE 117103 non-null object SPECIAL_ACTION_STATUS 184037 non-null object SPECIAL_ACTION_DATE 16604 non-null object BUILDING_CLASS 197846 non-null object JOB_NO_GOOD_COUNT 200000 non-null int64 GIS_LATITUDE 198805 non-null float64 GIS_LONGITUDE 198805 non-null float64 GIS_COUNCIL_DISTRICT 198805 non-null float64 GIS_CENSUS_TRACT 198805 non-null float64 GIS_NTA_NAME 198805 non-null object GIS_BIN 194686 non-null float64 dtypes: float64(7), int64(16), object(74) memory usage: 148.0+ MB

Some of those datatypes will cause a problem for plotting

df.dtypes
Unnamed: 0 int64 Job # int64 Doc # int64 Borough object House # object Street Name object Block object Lot object Bin # int64 Job Type object Job Status object Job Status Descrp object Latest Action Date object Building Type object Community - Board float64 Cluster object Landmarked object Adult Estab object Loft Board object City Owned object Little e object PC Filed object eFiling Filed object Plumbing object Mechanical object Boiler object Fuel Burning object Fuel Storage object Standpipe object Sprinkler object ... Zoning Dist3 object Special District 1 object Special District 2 object Owner Type object Non-Profit object Owner's First Name object Owner's Last Name object Owner's Business Name object Owner's House Number object Owner'sHouse Street Name object City object State object Zip object Owner'sPhone # object Job Description object DOBRunDate object JOB_S1_NO int64 TOTAL_CONSTRUCTION_FLOOR_AREA int64 WITHDRAWAL_FLAG int64 SIGNOFF_DATE object SPECIAL_ACTION_STATUS object SPECIAL_ACTION_DATE object BUILDING_CLASS object JOB_NO_GOOD_COUNT int64 GIS_LATITUDE float64 GIS_LONGITUDE float64 GIS_COUNCIL_DISTRICT float64 GIS_CENSUS_TRACT float64 GIS_NTA_NAME object GIS_BIN float64 Length: 97, dtype: object

Lets look at the Initial Cost and Total Estimated Fee Columns

%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns df[['Initial Cost', 'Total Est. Fee']].head()

Convert the 'Initial Cost and Total Est. Fee' columns to float

df['Initial Cost'] = df['Initial Cost'].str.replace('$','').astype('float') df['Total Est. Fee'] = df['Total Est. Fee'].str.replace("$", '').astype('float')

Look at the Borough Column

df['Borough'].value_counts(dropna=False)
BROOKLYN 72853 QUEENS 47902 MANHATTAN 41527 STATEN ISLAND 20247 BRONX 17471 Name: Borough, dtype: int64
sns.factorplot(kind='box', x='Borough', y='Initial Cost', data=df) plt.xticks(rotation = 70)
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)
Image in a Jupyter notebook

Look at the scatter plot of Initial cost vs Total Estimated. What do those 0's represent

def show_scatter(): g = sns.FacetGrid(df, col='Borough', sharex=False, sharey=False) g.map(plt.scatter, 'Initial Cost', 'Total Est. Fee') show_scatter()
Image in a Jupyter notebook
#there are a lot of values at x=0, perhaps that is all the nan values, hypothesis
cols = ['Initial Cost', 'Total Est. Fee', 'Job Type', 'Job Status', 'Job Status Descrp','Latest Action Date'] df.loc[df['Initial Cost']==0,cols].head()
df.loc[df['Initial Cost']==0,].describe()
import numpy as np df.loc[df['Initial Cost']==0, 'Initial Cost'] = np.nan show_scatter()
Image in a Jupyter notebook
#this removed the zero's that were due to nans

Lets compare regression outputs

df.columns = df.columns.str.replace(' ', '_') df.columns = df.columns.str.replace('.', '')
df.columns
Index(['Unnamed:_0', 'Job_#', 'Doc_#', 'Borough', 'House_#', 'Street_Name', 'Block', 'Lot', 'Bin_#', 'Job_Type', 'Job_Status', 'Job_Status_Descrp', 'Latest_Action_Date', 'Building_Type', 'Community_-_Board', 'Cluster', 'Landmarked', 'Adult_Estab', 'Loft_Board', 'City_Owned', 'Little_e', 'PC_Filed', 'eFiling_Filed', 'Plumbing', 'Mechanical', 'Boiler', 'Fuel_Burning', 'Fuel_Storage', 'Standpipe', 'Sprinkler', 'Fire_Alarm', 'Equipment', 'Fire_Suppression', 'Curb_Cut', 'Other', 'Other_Description', 'Applicant's_First_Name', 'Applicant's_Last_Name', 'Applicant_Professional_Title', 'Applicant_License_#', 'Professional_Cert', 'Pre-_Filing_Date', 'Paid', 'Fully_Paid', 'Assigned', 'Approved', 'Fully_Permitted', 'Initial_Cost', 'Total_Est_Fee', 'Fee_Status', 'Existing_Zoning_Sqft', 'Proposed_Zoning_Sqft', 'Horizontal_Enlrgmt', 'Vertical_Enlrgmt', 'Enlargement_SQ_Footage', 'Street_Frontage', 'ExistingNo_of_Stories', 'Proposed_No_of_Stories', 'Existing_Height', 'Proposed_Height', 'Existing_Dwelling_Units', 'Proposed_Dwelling_Units', 'Existing_Occupancy', 'Proposed_Occupancy', 'Site_Fill', 'Zoning_Dist1', 'Zoning_Dist2', 'Zoning_Dist3', 'Special_District_1', 'Special_District_2', 'Owner_Type', 'Non-Profit', 'Owner's_First_Name', 'Owner's_Last_Name', 'Owner's_Business_Name', 'Owner's_House_Number', 'Owner'sHouse_Street_Name', 'City_', 'State', 'Zip', 'Owner'sPhone_#', 'Job_Description', 'DOBRunDate', 'JOB_S1_NO', 'TOTAL_CONSTRUCTION_FLOOR_AREA', 'WITHDRAWAL_FLAG', 'SIGNOFF_DATE', 'SPECIAL_ACTION_STATUS', 'SPECIAL_ACTION_DATE', 'BUILDING_CLASS', 'JOB_NO_GOOD_COUNT', 'GIS_LATITUDE', 'GIS_LONGITUDE', 'GIS_COUNCIL_DISTRICT', 'GIS_CENSUS_TRACT', 'GIS_NTA_NAME', 'GIS_BIN'], dtype='object')
import statsmodels.formula.api as smf def regress(df, borough): return (borough, smf.ols('Total_Est_Fee ~ Initial_Cost', data = df, missing='drop').fit()) separate_regressions = [i for i in df.groupby('Borough') .apply(lambda x: regress(x, x['Borough'].unique()))] #the above skips the below #for group in df.groupby('Borough'): # regress(group)
separate_regressions
[(array(['BRONX'], dtype=object), <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x10931eb00>), (array(['BROOKLYN'], dtype=object), <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x109422518>), (array(['MANHATTAN'], dtype=object), <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x109304d30>), (array(['QUEENS'], dtype=object), <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x109422c18>), (array(['STATEN ISLAND'], dtype=object), <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x10f97e630>)]
for i in separate_regressions: print(i[0]) print(i[1].summary())
['BRONX'] OLS Regression Results ============================================================================== Dep. Variable: Total_Est_Fee R-squared: 0.914 Model: OLS Adj. R-squared: 0.914 Method: Least Squares F-statistic: 3.575e+04 Date: Wed, 23 May 2018 Prob (F-statistic): 0.00 Time: 20:10:56 Log-Likelihood: -33004. No. Observations: 3374 AIC: 6.601e+04 Df Residuals: 3372 BIC: 6.602e+04 Df Model: 1 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -329.1929 75.125 -4.382 0.000 -476.488 -181.898 Initial_Cost 0.0125 6.6e-05 189.083 0.000 0.012 0.013 ============================================================================== Omnibus: 7979.867 Durbin-Watson: 1.994 Prob(Omnibus): 0.000 Jarque-Bera (JB): 252323646.504 Skew: 22.723 Prob(JB): 0.00 Kurtosis: 1341.943 Cond. No. 1.16e+06 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.16e+06. This might indicate that there are strong multicollinearity or other numerical problems. ['BROOKLYN'] OLS Regression Results ============================================================================== Dep. Variable: Total_Est_Fee R-squared: 0.796 Model: OLS Adj. R-squared: 0.796 Method: Least Squares F-statistic: 1.220e+05 Date: Wed, 23 May 2018 Prob (F-statistic): 0.00 Time: 20:10:56 Log-Likelihood: -3.1998e+05 No. Observations: 31186 AIC: 6.400e+05 Df Residuals: 31184 BIC: 6.400e+05 Df Model: 1 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept 215.1242 39.312 5.472 0.000 138.071 292.177 Initial_Cost 0.0088 2.53e-05 349.307 0.000 0.009 0.009 ============================================================================== Omnibus: 119771.922 Durbin-Watson: 1.996 Prob(Omnibus): 0.000 Jarque-Bera (JB): 302567458230.073 Skew: -89.393 Prob(JB): 0.00 Kurtosis: 15261.333 Cond. No. 1.56e+06 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.56e+06. This might indicate that there are strong multicollinearity or other numerical problems. ['MANHATTAN'] OLS Regression Results ============================================================================== Dep. Variable: Total_Est_Fee R-squared: 0.967 Model: OLS Adj. R-squared: 0.967 Method: Least Squares F-statistic: 7.302e+05 Date: Wed, 23 May 2018 Prob (F-statistic): 0.00 Time: 20:10:56 Log-Likelihood: -2.4991e+05 No. Observations: 24608 AIC: 4.998e+05 Df Residuals: 24606 BIC: 4.998e+05 Df Model: 1 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept 90.0480 39.839 2.260 0.024 11.962 168.134 Initial_Cost 0.0105 1.23e-05 854.495 0.000 0.010 0.011 ============================================================================== Omnibus: 66181.946 Durbin-Watson: 2.004 Prob(Omnibus): 0.000 Jarque-Bera (JB): 34443300750.119 Skew: 31.456 Prob(JB): 0.00 Kurtosis: 5798.545 Cond. No. 3.26e+06 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.26e+06. This might indicate that there are strong multicollinearity or other numerical problems. ['QUEENS'] OLS Regression Results ============================================================================== Dep. Variable: Total_Est_Fee R-squared: 0.962 Model: OLS Adj. R-squared: 0.962 Method: Least Squares F-statistic: 2.236e+05 Date: Wed, 23 May 2018 Prob (F-statistic): 0.00 Time: 20:10:56 Log-Likelihood: -81202. No. Observations: 8785 AIC: 1.624e+05 Df Residuals: 8783 BIC: 1.624e+05 Df Model: 1 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept 5.4920 26.893 0.204 0.838 -47.224 58.208 Initial_Cost 0.0100 2.12e-05 472.863 0.000 0.010 0.010 ============================================================================== Omnibus: 5206.919 Durbin-Watson: 1.989 Prob(Omnibus): 0.000 Jarque-Bera (JB): 4165718100.760 Skew: 0.308 Prob(JB): 0.00 Kurtosis: 3376.492 Cond. No. 1.28e+06 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.28e+06. This might indicate that there are strong multicollinearity or other numerical problems. ['STATEN ISLAND'] OLS Regression Results ============================================================================== Dep. Variable: Total_Est_Fee R-squared: 0.998 Model: OLS Adj. R-squared: 0.998 Method: Least Squares F-statistic: 9.305e+05 Date: Wed, 23 May 2018 Prob (F-statistic): 0.00 Time: 20:10:56 Log-Likelihood: -15273. No. Observations: 1969 AIC: 3.055e+04 Df Residuals: 1967 BIC: 3.056e+04 Df Model: 1 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -41.4691 12.821 -3.234 0.001 -66.613 -16.325 Initial_Cost 0.0103 1.07e-05 964.608 0.000 0.010 0.010 ============================================================================== Omnibus: 1809.142 Durbin-Watson: 1.896 Prob(Omnibus): 0.000 Jarque-Bera (JB): 673436.226 Skew: -3.473 Prob(JB): 0.00 Kurtosis: 93.334 Cond. No. 1.21e+06 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.21e+06. This might indicate that there are strong multicollinearity or other numerical problems.
[(i[0], i[1].params[1]) for i in separate_regressions]
[(array(['BRONX'], dtype=object), 0.012478883889366172), (array(['BROOKLYN'], dtype=object), 0.008849756985075879), (array(['MANHATTAN'], dtype=object), 0.010478379863994247), (array(['QUEENS'], dtype=object), 0.010001338677093165), (array(['STATEN ISLAND'], dtype=object), 0.010302690934398567)]
pd.DataFrame([(i[0], round(i[1].params[1], 4)) for i in separate_regressions], columns = ['Borough', 'Beta'])