Random Forest
Random forest algorithm is a supervised classification algorithm. As the name suggest, this algorithm creates the forest with a number of trees.
Random forest algorithm is an ensemble classification algorithm. Ensemble classifier means a group of classifiers. Instead of using only one classifier to predict the target, In ensemble, we use multiple classifiers to predict the target.
In case, of random forest, these ensemble classifiers are the randomly created decision trees. Each decision tree is a single classifier and the target prediction is based on the majority voting method.
Every classifier will votes to one target class out of all the target classes and target class which got the most number of votes considered as the final predicted target class.
A random forest uses a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
estimators_ : list of DecisionTreeClassifier(The collection of fitted sub-estimators.)
Why Random Forest?
The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
Random forest classifier will handle the missing values.
When we have more trees in the forest, random forest classifier won’t overfit the model.
Can model the random forest classifier for categorical values also.
A Case Study : Prediction of health status of an individual based on life style & Socio econimic behaviour
The dataset was collected by the Centers for Disease Control and Prevention
Importing necessary libraries
The label imbalanced means that accuracy is not the best metric.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-d548f4b75d42> in <module>()
7 # Make a decision tree and train
8 tree = DecisionTreeClassifier(random_state=RSEED)
----> 9 tree.fit(train, train_labels)
C:\Users\HP\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
799 sample_weight=sample_weight,
800 check_input=check_input,
--> 801 X_idx_sorted=X_idx_sorted)
802 return self
803
C:\Users\HP\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
114 random_state = check_random_state(self.random_state)
115 if check_input:
--> 116 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
117 y = check_array(y, ensure_2d=False, dtype=None)
118 if issparse(X):
C:\Users\HP\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
525 try:
526 warnings.simplefilter('error', ComplexWarning)
--> 527 array = np.asarray(array, dtype=dtype, order=order)
528 except ComplexWarning:
529 raise ValueError("Complex data not supported\n"
C:\Users\HP\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
499
500 """
--> 501 return array(a, dtype, copy=False, order=order)
502
503
ValueError: could not convert string to float: "b'12022015'"
We can see that our model is extremely deep and has many nodes. To reduce the variance of our model, we could limit the maximum depth or the number of leaf nodes. Another method to reduce the variance is to use more trees, each one trained on a random sampling of the observations. This is where the random forest comes into play.
Random Forest
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-f6d156bc568e> in <module>()
8
9 # Fit on training data
---> 10 model.fit(train, train_labels)
C:\Users\HP\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in fit(self, X, y, sample_weight)
248
249 # Validate or convert input data
--> 250 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
251 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
252 if sample_weight is not None:
C:\Users\HP\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
525 try:
526 warnings.simplefilter('error', ComplexWarning)
--> 527 array = np.asarray(array, dtype=dtype, order=order)
528 except ComplexWarning:
529 raise ValueError("Complex data not supported\n"
C:\Users\HP\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
499
500 """
--> 501 return array(a, dtype, copy=False, order=order)
502
503
ValueError: could not convert string to float: "b'12022015'"