Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_11/Titanic Data Set (done).ipynb
1904 views
Kernel: Python 3

Import Titanic Data Set

import numpy as np import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier df = pd.read_csv("data/titanic.csv", dtype={'Pclass':object})
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True, axis=1) df.dropna(inplace=True) print(df.head())
Survived Pclass Sex Age SibSp Parch Fare Embarked 0 0 3 male 22.0 1 0 7.2500 S 1 1 1 female 38.0 1 0 71.2833 C 2 1 3 female 26.0 0 0 7.9250 S 3 1 1 female 35.0 1 0 53.1000 S 4 0 3 male 35.0 0 0 8.0500 S

Use the pd.get_dummies and to generate a few features out of the pclass, Sex, Sib, Parch columns.

Data documentation here https://www.kaggle.com/c/titanic/data . Pd.get_dummies documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

X = df.drop('Survived', axis=1) X = pd.get_dummies(X, drop_first=True) y = df.Survived print(X.head())
Age SibSp Parch Fare Pclass_2 Pclass_3 Sex_male Embarked_Q \ 0 22.0 1 0 7.2500 0 1 1 0 1 38.0 1 0 71.2833 0 0 0 0 2 26.0 0 0 7.9250 0 1 0 0 3 35.0 1 0 53.1000 0 0 0 0 4 35.0 0 0 8.0500 0 1 1 0 Embarked_S 0 1 1 0 2 1 3 1 4 1

Reduce your data set to the features you want, anad then create a train test split

from sklearn.model_selection import train_test_split
rfc = RandomForestClassifier() scores = cross_val_score(rfc, X, y, cv=50) print(np.average(scores)) # print((scores))
0.8032893772893773
from sklearn.metrics import classification_report import pprint as pp rfc.fit(X,y) pp.pprint(classification_report(rfc.predict(X), y))
(' precision recall f1-score support\n' '\n' ' 0 0.98 0.97 0.97 431\n' ' 1 0.95 0.97 0.96 281\n' '\n' 'avg / total 0.97 0.97 0.97 712\n')

Run a Random Forest Classifier to predict survival

Review the random forest documentation here.

  1. What are the main hyperparameters you have to tune for a decision tree?

  2. What are the main hyperparameters you have to tune for a random forest?

Use Grid Search or Cross Val Score to tune your parameters. What should your target metric be?