Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_11/Titanic Data Set.ipynb
1904 views
Kernel: Python 3

Import Titanic Data Set

import numpy as np import pandas as pd df = pd.read_csv("data/titanic.csv")
df.head()
list(df.columns)
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
df.dtypes
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
df.describe()
df.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
df[df.isnull().any(axis=1)] #.isnull() returns true or falses, and .any() returns the row if any one of the cells are T or F. #fyi .all() also returns the row if all the cells = true or false

Use the pd.get_dummies and to generate a few features out of the pclass, Sex, Sib, Parch columns.

Data documentation here https://www.kaggle.com/c/titanic/data . Pd.get_dummies documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

tit_dummies = pd.get_dummies(data=df, columns=['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'], drop_first=True) #tit_dummies = pd.get_dummies(df[['Pclass', 'Sex', 'SibSp', 'Parch']]) #this is wrong, but why does it run??? tit_dummies.head()
list(tit_dummies.columns)
['PassengerId', 'Survived', 'Name', 'Age', 'Ticket', 'Fare', 'Cabin', 'Pclass_2', 'Pclass_3', 'Sex_male', 'SibSp_1', 'SibSp_2', 'SibSp_3', 'SibSp_4', 'SibSp_5', 'SibSp_8', 'Parch_1', 'Parch_2', 'Parch_3', 'Parch_4', 'Parch_5', 'Parch_6', 'Embarked_Q', 'Embarked_S']
tit_dummies = tit_dummies.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
len(tit_dummies)
891
tit_dummies.dropna()
len(tit_dummies)
891

Reduce your data set to the features you want, anad then create a train test split

from sklearn.model_selection import train_test_split
y = df['Survived'] X = tit_dummies

Run a Random Forest Classifier to predict survival

Review the random forest documentation here.

  1. What are the main hyperparameters you have to tune for a decision tree?

  2. What are the main hyperparameters you have to tune for a random forest?

Use Grid Search or Cross Val Score to tune your parameters. What should your target metric be?

# Instantiate a RandomForestRegressor. #from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor() rf = rf.fit(X,y) preds = rf.predict(X) np.sqrt(metrics.mean_squared_error(y, preds)) #comment out np.sqrt(metrics.mean_squared_error(y, preds)) above if you want to run the below: #k_folds = 5 #scores = cross_val_score(rf, X, y, cv=k_folds, scoring='neg_mean_squared_error',) #scoring='neg_mean_squared_error' is good for regression, as it's a fine base line. for classification it means nothing as there is no mean to subtract from. #np.mean(np.sqrt(-scores))
---------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-74-2f706f1e5e52> in <module>() 5 6 rf = RandomForestRegressor() ----> 7 rf = rf.fit(X,y) 8 preds = rf.predict(X) 9 np.sqrt(metrics.mean_squared_error(y, preds)) /anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight) 245 """ 246 # Validate or convert input data --> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE) 248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None) 249 if sample_weight is not None: /anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 451 % (array.ndim, estimator_name)) 452 if force_all_finite: --> 453 _assert_all_finite(array) 454 455 shape_repr = _shape_repr(array.shape) /anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X) 42 and not np.isfinite(X).all()): 43 raise ValueError("Input contains NaN, infinity" ---> 44 " or a value too large for %r." % X.dtype) 45 46 ValueError: Input contains NaN, infinity or a value too large for dtype('float32').