Path: blob/master/lessons/Data Cleaning & Preprocessing.ipynb
1904 views
Kernel: Python 3
groupby
In [ ]:
In [ ]:
In [24]:
In [25]:
In [26]:
In [27]:
In [28]:
data information
In [29]:
Out[29]:
In [30]:
Out[30]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived 891 non-null int64
pclass 891 non-null int64
sex 891 non-null object
age 714 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
fare 891 non-null float64
embarked 889 non-null object
class 891 non-null category
who 891 non-null object
adult_male 891 non-null bool
deck 203 non-null category
embark_town 889 non-null object
alive 891 non-null object
alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
In [31]:
Out[31]:
In [32]:
Out[32]:
['survived',
'pclass',
'sex',
'age',
'sibsp',
'parch',
'fare',
'embarked',
'class',
'who',
'adult_male',
'deck',
'embark_town',
'alive',
'alone']
handling null values
filtering
In [66]:
In [72]:
Out[72]:
In [69]:
Out[69]:
714
dropping
In [36]:
In [37]:
In [38]:
Out[38]:
182
imputing
you can fill in with the mean, median or mode (if the data is categorical, then only the mode)
In [39]:
In [40]:
Out[40]:
29.69911764705882
In [41]:
In [42]:
Out[42]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived 891 non-null int64
pclass 891 non-null int64
sex 891 non-null object
age 891 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
fare 891 non-null float64
embarked 889 non-null object
class 891 non-null category
who 891 non-null object
adult_male 891 non-null bool
deck 203 non-null category
embark_town 889 non-null object
alive 891 non-null object
alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
In [ ]:
In [46]:
In [45]:
Out[45]:
28.0
In [47]:
In [49]:
Out[49]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived 891 non-null int64
pclass 891 non-null int64
sex 891 non-null object
age 891 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
fare 891 non-null float64
embarked 889 non-null object
class 891 non-null category
who 891 non-null object
adult_male 891 non-null bool
deck 203 non-null category
embark_town 889 non-null object
alive 891 non-null object
alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
In [ ]:
In [50]:
In [51]:
Out[51]:
0 24.0
dtype: float64
In [52]:
In [53]:
Out[53]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived 891 non-null int64
pclass 891 non-null int64
sex 891 non-null object
age 891 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
fare 891 non-null float64
embarked 889 non-null object
class 891 non-null category
who 891 non-null object
adult_male 891 non-null bool
deck 203 non-null category
embark_town 889 non-null object
alive 891 non-null object
alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
In [ ]:
for next time, add a more sophisticated method, by running a LGR for categorical or LR for continuous, or can run a random forest to predict the imputed value
In [ ]:
In [ ]:
data distribution
In [ ]:
In [ ]:
In [ ]:
In [ ]:
TO DO:
Transformers: CountVectorizer (use instead of get dummies?)
randomized search, like grid search but is random
TFIDF
NLP
StandardScaler and imputer
SpaCy
In [2]:
Out[2]:
2.0.11