Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/Data Cleaning & Preprocessing.ipynb
1904 views
Kernel: Python 3

groupby

import seaborn as sns
titanic_df = sns.load_dataset('titanic')
flights_df = sns.load_dataset('flights')
exercise_df = sns.load_dataset('exercise')
planets_df = sns.load_dataset('planets')

data information

titanic_df.head()
titanic_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): survived 891 non-null int64 pclass 891 non-null int64 sex 891 non-null object age 714 non-null float64 sibsp 891 non-null int64 parch 891 non-null int64 fare 891 non-null float64 embarked 889 non-null object class 891 non-null category who 891 non-null object adult_male 891 non-null bool deck 203 non-null category embark_town 889 non-null object alive 891 non-null object alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.6+ KB
titanic_df.describe()
list(titanic_df.columns)
['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']

handling null values

filtering

titanic_df_filter = titanic_df.copy()
titanic_df_filter.isna() #gives true for null values titanic_df_filter.notna() #gives false for null values
len(titanic_df_filter[titanic_df_filter['age'].notna()]) #gives df where only age is not null
714

dropping

titanic_df_drop = titanic_df.copy()
titanic_df_drop.dropna(inplace=True) #drops all na's in entire df
len(titanic_df_drop)
182

imputing

you can fill in with the mean, median or mode (if the data is categorical, then only the mode)

titanic_df_mean = titanic_df.copy()
titanic_df_mean['age'].mean()
29.69911764705882
titanic_df_mean.loc[titanic_df_mean['age'].isna(), 'age'] = titanic_df_mean['age'].mean()
titanic_df_mean.info() #no more null values for age
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): survived 891 non-null int64 pclass 891 non-null int64 sex 891 non-null object age 891 non-null float64 sibsp 891 non-null int64 parch 891 non-null int64 fare 891 non-null float64 embarked 889 non-null object class 891 non-null category who 891 non-null object adult_male 891 non-null bool deck 203 non-null category embark_town 889 non-null object alive 891 non-null object alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.6+ KB
titanic_df_median = titanic_df.copy()
titanic_df_median['age'].median()
28.0
titanic_df_median.loc[titanic_df_median['age'].isna(), 'age'] = titanic_df_median['age'].median()
titanic_df_median.info() #no more null values for age
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): survived 891 non-null int64 pclass 891 non-null int64 sex 891 non-null object age 891 non-null float64 sibsp 891 non-null int64 parch 891 non-null int64 fare 891 non-null float64 embarked 889 non-null object class 891 non-null category who 891 non-null object adult_male 891 non-null bool deck 203 non-null category embark_town 889 non-null object alive 891 non-null object alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.6+ KB
titanic_df_mode = titanic_df.copy()
titanic_df_mode['age'].mode()
0 24.0 dtype: float64
titanic_df_mode.loc[titanic_df_mode['age'].isna(), 'age'] = titanic_df_mode['age'].median()
titanic_df_mode.info() #no more null values for age
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): survived 891 non-null int64 pclass 891 non-null int64 sex 891 non-null object age 891 non-null float64 sibsp 891 non-null int64 parch 891 non-null int64 fare 891 non-null float64 embarked 889 non-null object class 891 non-null category who 891 non-null object adult_male 891 non-null bool deck 203 non-null category embark_town 889 non-null object alive 891 non-null object alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.6+ KB

for next time, add a more sophisticated method, by running a LGR for categorical or LR for continuous, or can run a random forest to predict the imputed value

data distribution

TO DO:

Transformers: CountVectorizer (use instead of get dummies?)

randomized search, like grid search but is random

TFIDF

NLP

StandardScaler and imputer

SpaCy

import spacy from spacy import displacy print(spacy.__version__) nlp = spacy.load('en')
2.0.11