Path: blob/master/lessons/lesson_11/Titanic Data Set.ipynb
1904 views
Kernel: Python 3
Import Titanic Data Set
In [1]:
In [2]:
Out[2]:
In [3]:
Out[3]:
['PassengerId',
'Survived',
'Pclass',
'Name',
'Sex',
'Age',
'SibSp',
'Parch',
'Ticket',
'Fare',
'Cabin',
'Embarked']
In [4]:
Out[4]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
In [63]:
Out[63]:
In [64]:
Out[64]:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In [65]:
Out[65]:
Use the pd.get_dummies and to generate a few features out of the pclass, Sex, Sib, Parch columns.
Data documentation here https://www.kaggle.com/c/titanic/data . Pd.get_dummies documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
In [66]:
Out[66]:
In [67]:
Out[67]:
['PassengerId',
'Survived',
'Name',
'Age',
'Ticket',
'Fare',
'Cabin',
'Pclass_2',
'Pclass_3',
'Sex_male',
'SibSp_1',
'SibSp_2',
'SibSp_3',
'SibSp_4',
'SibSp_5',
'SibSp_8',
'Parch_1',
'Parch_2',
'Parch_3',
'Parch_4',
'Parch_5',
'Parch_6',
'Embarked_Q',
'Embarked_S']
In [68]:
In [69]:
Out[69]:
891
In [70]:
Out[70]:
In [71]:
Out[71]:
891
Reduce your data set to the features you want, anad then create a train test split
In [72]:
In [73]:
Run a Random Forest Classifier to predict survival
Review the random forest documentation here.
What are the main hyperparameters you have to tune for a decision tree?
What are the main hyperparameters you have to tune for a random forest?
Use Grid Search or Cross Val Score to tune your parameters. What should your target metric be?
In [74]:
Out[74]:
----------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-74-2f706f1e5e52> in <module>()
5
6 rf = RandomForestRegressor()
----> 7 rf = rf.fit(X,y)
8 preds = rf.predict(X)
9 np.sqrt(metrics.mean_squared_error(y, preds))
/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
245 """
246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
249 if sample_weight is not None:
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
451 % (array.ndim, estimator_name))
452 if force_all_finite:
--> 453 _assert_all_finite(array)
454
455 shape_repr = _shape_repr(array.shape)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
42 and not np.isfinite(X).all()):
43 raise ValueError("Input contains NaN, infinity"
---> 44 " or a value too large for %r." % X.dtype)
45
46
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
In [ ]: