GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/Random Forest.ipynb
³⁰⁸⁷ views

Kernel: Python 3

Random Forest

Random forest algorithm is a supervised classification algorithm. As the name suggest, this algorithm creates the forest with a number of trees.

Random forest algorithm is an ensemble classification algorithm. Ensemble classifier means a group of classifiers. Instead of using only one classifier to predict the target, In ensemble, we use multiple classifiers to predict the target.
In case, of random forest, these ensemble classifiers are the randomly created decision trees. Each decision tree is a single classifier and the target prediction is based on the majority voting method.
Every classifier will votes to one target class out of all the target classes and target class which got the most number of votes considered as the final predicted target class.
A random forest uses a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
- estimators_ : list of DecisionTreeClassifier(The collection of fitted sub-estimators.)

Why Random Forest?

The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
Random forest classifier will handle the missing values.
When we have more trees in the forest, random forest classifier won’t overfit the model.
Can model the random forest classifier for categorical values also.

In [1]:

# A simple Example to understand random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
clf = RandomForestClassifier(n_estimators=100, max_depth=2,
                             random_state=0)
clf.fit(X, y)


print(clf.feature_importances_)

print(clf.predict([[0, 0, 0, 0]]))

Out[1]:

[0.14205973 0.76664038 0.0282433  0.06305659]
[1]

A Case Study : Prediction of health status of an individual based on life style & Socio econimic behaviour

The dataset was collected by the Centers for Disease Control and Prevention

Importing necessary libraries

In [2]:

import pandas as pd
import numpy as np
# Set random seed to ensure reproducible runs
RSEED = 50

In [3]:

d=pd.read_csv('F:\\ML & Data Visualization\\health\\2015.csv').sample(100000, random_state = RSEED)
d.head()

Out[3]:

In [4]:

# Descriptive statistics for each column
d.describe()

Out[4]:

In [5]:

#Label Distribution
d['_RFHLTH'] = d['_RFHLTH'].replace({2: 0})
d= d.loc[d['_RFHLTH'].isin([0, 1])].copy()
d = d.rename(columns = {'_RFHLTH': 'label'})
d['label'].value_counts()

Out[5]:

1.0    81140
0.0    18579
Name: label, dtype: int64

The label imbalanced means that accuracy is not the best metric.

In [6]:

d = d.drop(columns = ['POORHLTH', 'PHYSHLTH', 'GENHLTH', 'PAINACT2', 
                        'QLMENTL2', 'QLSTRES2', 'QLHLTH2', 'HLTHPLN1', 'MENTHLTH'])

In [7]:

from sklearn.model_selection import train_test_split

# Extract the labels
labels = np.array(d.pop('label'))

# 30% examples in test data
train, test, train_labels, test_labels = train_test_split(d, labels, 
                                                          stratify = labels,
                                                          test_size = 0.3, 
                                                          random_state = RSEED)

In [8]:

#Imputation of Missing values
#We'll fill in the missing values with the mean of the column. 
#It's important to note that we fill in missing values in the test set with the mean of columns in the training data. This is necessary because if we get new data, we'll have to use the training data to fill in any missing values.

train = train.fillna(train.mean())
test = test.fillna(test.mean())

# Features for feature importances
features = list(train.columns)

In [9]:

train.shape

Out[9]:

(69803, 320)

In [10]:

test.shape

Out[10]:

(29916, 320)

In [12]:

#Decision Tree on Real Data¶
#First, we'll train the decision tree on the data. Let's leave the depth unlimited and see if we get overfitting!

# Train tree
from sklearn.tree import DecisionTreeClassifier

# Make a decision tree and train
tree = DecisionTreeClassifier(random_state=RSEED)
tree.fit(train, train_labels)

Out[12]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-d548f4b75d42> in <module>()
      7 # Make a decision tree and train
      8 tree = DecisionTreeClassifier(random_state=RSEED)
----> 9 tree.fit(train, train_labels)

C:\Users\HP\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    799             sample_weight=sample_weight,
    800             check_input=check_input,
--> 801             X_idx_sorted=X_idx_sorted)
    802         return self
    803 
C:\Users\HP\Anaconda3\lib\site-packages\sklearn\tree\tree.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    114         random_state = check_random_state(self.random_state)
    115         if check_input:
--> 116             X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    117             y = check_array(y, ensure_2d=False, dtype=None)
    118             if issparse(X):
C:\Users\HP\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    525             try:
    526                 warnings.simplefilter('error', ComplexWarning)
--> 527                 array = np.asarray(array, dtype=dtype, order=order)
    528             except ComplexWarning:
    529                 raise ValueError("Complex data not supported\n"
C:\Users\HP\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    499 
    500     """
--> 501     return array(a, dtype, copy=False, order=order)
    502 
    503 
ValueError: could not convert string to float: "b'12022015'"

In [ ]:

# Make probability predictions
train_probs = tree.predict_proba(train)[:, 1]
probs = tree.predict_proba(test)[:, 1]

train_predictions = tree.predict(train)
predictions = tree.predict(test)
from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve

print(f'Train ROC AUC Score: {roc_auc_score(train_labels, train_probs)}')
print(f'Test ROC AUC  Score: {roc_auc_score(test_labels, probs)}')

In [ ]:

## Confusion Matrix
from sklearn.metrics import confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Oranges):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    Source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.figure(figsize = (10, 10))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title, size = 24)
    plt.colorbar(aspect=4)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45, size = 14)
    plt.yticks(tick_marks, classes, size = 14)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    
    # Labeling the plot
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), fontsize = 20,
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
        
    plt.grid(None)
    plt.tight_layout()
    plt.ylabel('True label', size = 18)
    plt.xlabel('Predicted label', size = 18)

In [ ]:

cm = confusion_matrix(test_labels, predictions)
plot_confusion_matrix(cm, classes = ['Poor Health', 'Good Health'],
                      title = 'Health Confusion Matrix')

In [ ]:

#Feature Importances
#Finally, we can take a look at the features considered most important by the Decision Tree. The values are computed by summing the reduction in Gini Impurity over all of the nodes of the tree in which the feature is used.

fi = pd.DataFrame({'feature': features,
                   'importance': tree.feature_importances_}).\
                    sort_values('importance', ascending = False)
fi.head()

In [ ]:

#Visualize Full Tree
# Save tree as dot file
export_graphviz(tree, 'tree_real_data.dot', rounded = True, 
                feature_names = features, max_depth = 6,
                class_names = ['poor health', 'good health'], filled = True)

# Convert to png
call(['dot', '-Tpng', 'tree_real_data.dot', '-o', 'tree_real_data.png', '-Gdpi=200'])

# Visualize
Image(filename='tree_real_data.png'

We can see that our model is extremely deep and has many nodes. To reduce the variance of our model, we could limit the maximum depth or the number of leaf nodes. Another method to reduce the variance is to use more trees, each one trained on a random sampling of the observations. This is where the random forest comes into play.

Random Forest

In [13]:

from sklearn.ensemble import RandomForestClassifier

# Create the model with 100 trees
model = RandomForestClassifier(n_estimators=100, 
                               random_state=RSEED, 
                               max_features = 'sqrt',
                               n_jobs=-1, verbose = 1)

# Fit on training data
model.fit(train, train_labels)

Out[13]:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-f6d156bc568e> in <module>()
      8 
      9 # Fit on training data
---> 10 model.fit(train, train_labels)

C:\Users\HP\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py in fit(self, X, y, sample_weight)
    248 
    249         # Validate or convert input data
--> 250         X = check_array(X, accept_sparse="csc", dtype=DTYPE)
    251         y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
    252         if sample_weight is not None:
C:\Users\HP\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    525             try:
    526                 warnings.simplefilter('error', ComplexWarning)
--> 527                 array = np.asarray(array, dtype=dtype, order=order)
    528             except ComplexWarning:
    529                 raise ValueError("Complex data not supported\n"
C:\Users\HP\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    499 
    500     """
--> 501     return array(a, dtype, copy=False, order=order)
    502 
    503 
ValueError: could not convert string to float: "b'12022015'"

In [ ]:

#We can see how many nodes there are for each tree on average and the maximum depth of each tree. 
#There were 100 trees in the forest
n_nodes = []
max_depths = []

for ind_tree in model.estimators_:
    n_nodes.append(ind_tree.tree_.node_count)
    max_depths.append(ind_tree.tree_.max_depth)
    
print(f'Average number of nodes {int(np.mean(n_nodes))}')
print(f'Average maximum depth {int(np.mean(max_depths))}')

We see that each decision tree in the forest has many nodes and is extremely deep. However, even though each individual decision tree may overfit to a particular subset of the training data, the idea is that the overall random forest should have a reduced variance.

In [ ]:

# Results
train_rf_predictions = model.predict(train)
train_rf_probs = model.predict_proba(train)[:, 1]

rf_predictions = model.predict(test)
rf_probs = model.predict_proba(test)[:, 1]

In [ ]:

evaluate_model(rf_predictions, rf_probs, train_rf_predictions, train_rf_probs)

The model still achieves perfect measures on the training data, but this time, the testing scores are much better. If we compare the ROC AUC, we see that the random forest does significantly better than a single decision tree.

In [ ]:

cm = confusion_matrix(test_labels, rf_predictions)
plot_confusion_matrix(cm, classes = ['Poor Health', 'Good Health'],
                      title = 'Health Confusion Matrix')

Compared to the single decision tree, the model has fewer false postives although more false negatives. Overall, the random forest does significantly better than a single decision tree.

In [ ]:

fi_model = pd.DataFrame({'feature': features,
                   'importance': model.feature_importances_}).\
                    sort_values('importance', ascending = False)
fi_model.head(10)

Random Forest

Why Random Forest?

A Case Study : Prediction of health status of an individual based on life style & Socio econimic behaviour

Importing necessary libraries

Random Forest

We see that each decision tree in the forest has many nodes and is extremely deep. However, even though each individual decision tree may overfit to a particular subset of the training data, the idea is that the overall random forest should have a reduced variance.

The model still achieves perfect measures on the training data, but this time, the testing scores are much better. If we compare the ROC AUC, we see that the random forest does significantly better than a single decision tree.

Compared to the single decision tree, the model has fewer false postives although more false negatives. Overall, the random forest does significantly better than a single decision tree.

Product

Resources

Company