GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_11/Titanic Data Set.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Import Titanic Data Set

In [1]:

import numpy as np
import pandas as pd
df = pd.read_csv("data/titanic.csv")

In [2]:

df.head()

Out[2]:

In [3]:

list(df.columns)

Out[3]:

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [4]:

df.dtypes

Out[4]:

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [63]:

df.describe()

Out[63]:

In [64]:

df.isnull().sum()

Out[64]:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [65]:

df[df.isnull().any(axis=1)] 

#.isnull() returns true or falses, and .any() returns the row if any one of the cells are T or F. 
#fyi .all() also returns the row if all the cells = true or false

Out[65]:

Use the pd.get_dummies and to generate a few features out of the pclass, Sex, Sib, Parch columns.

Data documentation here https://www.kaggle.com/c/titanic/data . Pd.get_dummies documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

In [66]:

tit_dummies = pd.get_dummies(data=df, columns=['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked'], drop_first=True)
#tit_dummies = pd.get_dummies(df[['Pclass', 'Sex', 'SibSp', 'Parch']]) #this is wrong, but why does it run???
tit_dummies.head()

Out[66]:

In [67]:

list(tit_dummies.columns)

Out[67]:

['PassengerId',
 'Survived',
 'Name',
 'Age',
 'Ticket',
 'Fare',
 'Cabin',
 'Pclass_2',
 'Pclass_3',
 'Sex_male',
 'SibSp_1',
 'SibSp_2',
 'SibSp_3',
 'SibSp_4',
 'SibSp_5',
 'SibSp_8',
 'Parch_1',
 'Parch_2',
 'Parch_3',
 'Parch_4',
 'Parch_5',
 'Parch_6',
 'Embarked_Q',
 'Embarked_S']

In [68]:

tit_dummies = tit_dummies.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

In [69]:

len(tit_dummies)

Out[69]:

891

In [70]:

tit_dummies.dropna()

Out[70]:

In [71]:

len(tit_dummies)

Out[71]:

891

Reduce your data set to the features you want, anad then create a train test split

In [72]:

from sklearn.model_selection import train_test_split

In [73]:

y = df['Survived']
X = tit_dummies

Run a Random Forest Classifier to predict survival

Review the random forest documentation here.

What are the main hyperparameters you have to tune for a decision tree?
What are the main hyperparameters you have to tune for a random forest?

Use Grid Search or Cross Val Score to tune your parameters. What should your target metric be?

In [74]:

# Instantiate a RandomForestRegressor.

#from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf = rf.fit(X,y)
preds = rf.predict(X)
np.sqrt(metrics.mean_squared_error(y, preds))

#comment out np.sqrt(metrics.mean_squared_error(y, preds)) above if you want to run the below:

#k_folds =  5
#scores = cross_val_score(rf, X, y, cv=k_folds, scoring='neg_mean_squared_error',) #scoring='neg_mean_squared_error' is good for regression, as it's a fine base line. for classification it means nothing as there is no mean to subtract from.
#np.mean(np.sqrt(-scores))

Out[74]:

----------------------------------------------------------------------
ValueError                           Traceback (most recent call last)
<ipython-input-74-2f706f1e5e52> in <module>()
      5 
      6 rf = RandomForestRegressor()
----> 7 rf = rf.fit(X,y)
      8 preds = rf.predict(X)
      9 np.sqrt(metrics.mean_squared_error(y, preds))
/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
    245         """
    246         # Validate or convert input data
--> 247         X = check_array(X, accept_sparse="csc", dtype=DTYPE)
    248         y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
    249         if sample_weight is not None:
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    451                              % (array.ndim, estimator_name))
    452         if force_all_finite:
--> 453             _assert_all_finite(array)
    454 
    455     shape_repr = _shape_repr(array.shape)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
     42             and not np.isfinite(X).all()):
     43         raise ValueError("Input contains NaN, infinity"
---> 44                          " or a value too large for %r." % X.dtype)
     45 
     46 
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [ ]:

Import Titanic Data Set

Use the pd.get_dummies and to generate a few features out of the pclass, Sex, Sib, Parch columns.

Reduce your data set to the features you want, anad then create a train test split

Run a Random Forest Classifier to predict survival

Product

Resources

Company