Import Titanic Data Set

In [1]:

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv("data/titanic.csv", dtype={'Pclass':object})

In [2]:

df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True, axis=1)
df.dropna(inplace=True)
print(df.head())

Out[2]:

   Survived Pclass     Sex   Age  SibSp  Parch     Fare Embarked
       0      3    male  22.0      1      0   7.2500        S
       1      1  female  38.0      1      0  71.2833        C
       1      3  female  26.0      0      0   7.9250        S
       1      1  female  35.0      1      0  53.1000        S
       0      3    male  35.0      0      0   8.0500        S

Use the pd.get_dummies and to generate a few features out of the pclass, Sex, Sib, Parch columns.

Data documentation here https://www.kaggle.com/c/titanic/data . Pd.get_dummies documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

In [3]:

X = df.drop('Survived', axis=1)
X = pd.get_dummies(X, drop_first=True)
y = df.Survived

print(X.head())

Out[3]:

    Age  SibSp  Parch     Fare  Pclass_2  Pclass_3  Sex_male  Embarked_Q  \
22.0      1      0   7.2500         0         1         1           0   
38.0      1      0  71.2833         0         0         0           0   
26.0      0      0   7.9250         0         1         0           0   
35.0      1      0  53.1000         0         0         0           0   
35.0      0      0   8.0500         0         1         1           0   

   Embarked_S  
         1  
         0  
         1  
         1  
         1  

Reduce your data set to the features you want, anad then create a train test split

In [4]:

from sklearn.model_selection import train_test_split

In [5]:

rfc = RandomForestClassifier()
scores = cross_val_score(rfc, X, y, cv=50)
print(np.average(scores))
# print((scores))

Out[5]:

0.8032893772893773

In [6]:

from sklearn.metrics import classification_report
import pprint as pp
rfc.fit(X,y)
pp.pprint(classification_report(rfc.predict(X), y))

Out[6]:

('             precision    recall  f1-score   support\n'
 '\n'
 '          0       0.98      0.97      0.97       431\n'
 '          1       0.95      0.97      0.96       281\n'
 '\n'
 'avg / total       0.97      0.97      0.97       712\n')

Run a Random Forest Classifier to predict survival

Review the random forest documentation here.

What are the main hyperparameters you have to tune for a decision tree?
What are the main hyperparameters you have to tune for a random forest?

Use Grid Search or Cross Val Score to tune your parameters. What should your target metric be?

Import Titanic Data Set

Use the pd.get_dummies and to generate a few features out of the pclass, Sex, Sib, Parch columns.

Reduce your data set to the features you want, anad then create a train test split

Run a Random Forest Classifier to predict survival

Product

Resources

Company