Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/Random Forest.ipynb
3087 views
Kernel: Python 3

Random Forest

Random forest algorithm is a supervised classification algorithm. As the name suggest, this algorithm creates the forest with a number of trees.

  • Random forest algorithm is an ensemble classification algorithm. Ensemble classifier means a group of classifiers. Instead of using only one classifier to predict the target, In ensemble, we use multiple classifiers to predict the target.

  • In case, of random forest, these ensemble classifiers are the randomly created decision trees. Each decision tree is a single classifier and the target prediction is based on the majority voting method.

  • Every classifier will votes to one target class out of all the target classes and target class which got the most number of votes considered as the final predicted target class.

  • A random forest uses a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

  • The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

    • estimators_ : list of DecisionTreeClassifier(The collection of fitted sub-estimators.)

image.png

Why Random Forest?

  • The same random forest algorithm or the random forest classifier can use for both classification and the regression task.

  • Random forest classifier will handle the missing values.

  • When we have more trees in the forest, random forest classifier won’t overfit the model.

  • Can model the random forest classifier for categorical values also.

# A simple Example to understand random forest from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False) clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0) clf.fit(X, y) print(clf.feature_importances_) print(clf.predict([[0, 0, 0, 0]]))
[0.14205973 0.76664038 0.0282433 0.06305659] [1]

A Case Study : Prediction of health status of an individual based on life style & Socio econimic behaviour

The dataset was collected by the Centers for Disease Control and Prevention

Importing necessary libraries

import pandas as pd import numpy as np # Set random seed to ensure reproducible runs RSEED = 50
d=pd.read_csv('F:\\ML & Data Visualization\\health\\2015.csv').sample(100000, random_state = RSEED) d.head()
# Descriptive statistics for each column d.describe()
#Label Distribution d['_RFHLTH'] = d['_RFHLTH'].replace({2: 0}) d= d.loc[d['_RFHLTH'].isin([0, 1])].copy() d = d.rename(columns = {'_RFHLTH': 'label'}) d['label'].value_counts()
1.0 81140 0.0 18579 Name: label, dtype: int64

The label imbalanced means that accuracy is not the best metric.

d = d.drop(columns = ['POORHLTH', 'PHYSHLTH', 'GENHLTH', 'PAINACT2', 'QLMENTL2', 'QLSTRES2', 'QLHLTH2', 'HLTHPLN1', 'MENTHLTH'])
from sklearn.model_selection import train_test_split # Extract the labels labels = np.array(d.pop('label')) # 30% examples in test data train, test, train_labels, test_labels = train_test_split(d, labels, stratify = labels, test_size = 0.3, random_state = RSEED)
#Imputation of Missing values #We'll fill in the missing values with the mean of the column. #It's important to note that we fill in missing values in the test set with the mean of columns in the training data. This is necessary because if we get new data, we'll have to use the training data to fill in any missing values. train = train.fillna(train.mean()) test = test.fillna(test.mean()) # Features for feature importances features = list(train.columns)
train.shape
(69803, 320)
test.shape
(29916, 320)
#Decision Tree on Real Data¶ #First, we'll train the decision tree on the data. Let's leave the depth unlimited and see if we get overfitting! # Train tree from sklearn.tree import DecisionTreeClassifier # Make a decision tree and train tree = DecisionTreeClassifier(random_state=RSEED) tree.fit(train, train_labels)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-12-d548f4b75d42> in <module>() 7 # Make a decision tree and train 8 tree = DecisionTreeClassifier(random_state=RSEED) ----> 9 tree.fit(train, train_labels) C:\Users\HP\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted) 799 sample_weight=sample_weight, 800 check_input=check_input, --> 801 X_idx_sorted=X_idx_sorted) 802 return self 803 C:\Users\HP\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted) 114 random_state = check_random_state(self.random_state) 115 if check_input: --> 116 X = check_array(X, dtype=DTYPE, accept_sparse="csc") 117 y = check_array(y, ensure_2d=False, dtype=None) 118 if issparse(X): C:\Users\HP\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 525 try: 526 warnings.simplefilter('error', ComplexWarning) --> 527 array = np.asarray(array, dtype=dtype, order=order) 528 except ComplexWarning: 529 raise ValueError("Complex data not supported\n" C:\Users\HP\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order) 499 500 """ --> 501 return array(a, dtype, copy=False, order=order) 502 503 ValueError: could not convert string to float: "b'12022015'"
# Make probability predictions train_probs = tree.predict_proba(train)[:, 1] probs = tree.predict_proba(test)[:, 1] train_predictions = tree.predict(train) predictions = tree.predict(test) from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve print(f'Train ROC AUC Score: {roc_auc_score(train_labels, train_probs)}') print(f'Test ROC AUC Score: {roc_auc_score(test_labels, probs)}')
## Confusion Matrix from sklearn.metrics import confusion_matrix import itertools def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Oranges): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html """ if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print('Confusion matrix, without normalization') print(cm) plt.figure(figsize = (10, 10)) plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title, size = 24) plt.colorbar(aspect=4) tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45, size = 14) plt.yticks(tick_marks, classes, size = 14) fmt = '.2f' if normalize else 'd' thresh = cm.max() / 2. # Labeling the plot for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), fontsize = 20, horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.grid(None) plt.tight_layout() plt.ylabel('True label', size = 18) plt.xlabel('Predicted label', size = 18)
cm = confusion_matrix(test_labels, predictions) plot_confusion_matrix(cm, classes = ['Poor Health', 'Good Health'], title = 'Health Confusion Matrix')
#Feature Importances #Finally, we can take a look at the features considered most important by the Decision Tree. The values are computed by summing the reduction in Gini Impurity over all of the nodes of the tree in which the feature is used. fi = pd.DataFrame({'feature': features, 'importance': tree.feature_importances_}).\ sort_values('importance', ascending = False) fi.head()
#Visualize Full Tree # Save tree as dot file export_graphviz(tree, 'tree_real_data.dot', rounded = True, feature_names = features, max_depth = 6, class_names = ['poor health', 'good health'], filled = True) # Convert to png call(['dot', '-Tpng', 'tree_real_data.dot', '-o', 'tree_real_data.png', '-Gdpi=200']) # Visualize Image(filename='tree_real_data.png'

We can see that our model is extremely deep and has many nodes. To reduce the variance of our model, we could limit the maximum depth or the number of leaf nodes. Another method to reduce the variance is to use more trees, each one trained on a random sampling of the observations. This is where the random forest comes into play.

Random Forest

from sklearn.ensemble import RandomForestClassifier # Create the model with 100 trees model = RandomForestClassifier(n_estimators=100, random_state=RSEED, max_features = 'sqrt', n_jobs=-1, verbose = 1) # Fit on training data model.fit(train, train_labels)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-13-f6d156bc568e> in <module>() 8 9 # Fit on training data ---> 10 model.fit(train, train_labels) C:\Users\HP\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in fit(self, X, y, sample_weight) 248 249 # Validate or convert input data --> 250 X = check_array(X, accept_sparse="csc", dtype=DTYPE) 251 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None) 252 if sample_weight is not None: C:\Users\HP\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 525 try: 526 warnings.simplefilter('error', ComplexWarning) --> 527 array = np.asarray(array, dtype=dtype, order=order) 528 except ComplexWarning: 529 raise ValueError("Complex data not supported\n" C:\Users\HP\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order) 499 500 """ --> 501 return array(a, dtype, copy=False, order=order) 502 503 ValueError: could not convert string to float: "b'12022015'"
#We can see how many nodes there are for each tree on average and the maximum depth of each tree. #There were 100 trees in the forest n_nodes = [] max_depths = [] for ind_tree in model.estimators_: n_nodes.append(ind_tree.tree_.node_count) max_depths.append(ind_tree.tree_.max_depth) print(f'Average number of nodes {int(np.mean(n_nodes))}') print(f'Average maximum depth {int(np.mean(max_depths))}')

We see that each decision tree in the forest has many nodes and is extremely deep. However, even though each individual decision tree may overfit to a particular subset of the training data, the idea is that the overall random forest should have a reduced variance.

# Results train_rf_predictions = model.predict(train) train_rf_probs = model.predict_proba(train)[:, 1] rf_predictions = model.predict(test) rf_probs = model.predict_proba(test)[:, 1]
evaluate_model(rf_predictions, rf_probs, train_rf_predictions, train_rf_probs)

The model still achieves perfect measures on the training data, but this time, the testing scores are much better. If we compare the ROC AUC, we see that the random forest does significantly better than a single decision tree.

cm = confusion_matrix(test_labels, rf_predictions) plot_confusion_matrix(cm, classes = ['Poor Health', 'Good Health'], title = 'Health Confusion Matrix')

Compared to the single decision tree, the model has fewer false postives although more false negatives. Overall, the random forest does significantly better than a single decision tree.

fi_model = pd.DataFrame({'feature': features, 'importance': model.feature_importances_}).\ sort_values('importance', ascending = False) fi_model.head(10)