Path: blob/master/lessons/lesson_11/Titanic Data Set (done).ipynb
1904 views
Kernel: Python 3
Import Titanic Data Set
In [1]:
In [2]:
Out[2]:
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S
Use the pd.get_dummies and to generate a few features out of the pclass, Sex, Sib, Parch columns.
Data documentation here https://www.kaggle.com/c/titanic/data . Pd.get_dummies documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
In [3]:
Out[3]:
Age SibSp Parch Fare Pclass_2 Pclass_3 Sex_male Embarked_Q \
0 22.0 1 0 7.2500 0 1 1 0
1 38.0 1 0 71.2833 0 0 0 0
2 26.0 0 0 7.9250 0 1 0 0
3 35.0 1 0 53.1000 0 0 0 0
4 35.0 0 0 8.0500 0 1 1 0
Embarked_S
0 1
1 0
2 1
3 1
4 1
Reduce your data set to the features you want, anad then create a train test split
In [4]:
In [5]:
Out[5]:
0.8032893772893773
In [6]:
Out[6]:
(' precision recall f1-score support\n'
'\n'
' 0 0.98 0.97 0.97 431\n'
' 1 0.95 0.97 0.96 281\n'
'\n'
'avg / total 0.97 0.97 0.97 712\n')
Run a Random Forest Classifier to predict survival
Review the random forest documentation here.
What are the main hyperparameters you have to tune for a decision tree?
What are the main hyperparameters you have to tune for a random forest?
Use Grid Search or Cross Val Score to tune your parameters. What should your target metric be?