GitHub Repository: AllenDowney/bayesian-analysis-recipes
Path: blob/master/incubator/multiclass-logistic-regression-cover-type.ipynb
⁴¹¹ views

Kernel: bayesian

In [ ]:

import pymc3 as pm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import patsy
from sklearn.preprocessing import normalize, MinMaxScaler

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Introduction

Multi-class Bayesian Logistic Regression

The task at hand is to predict forest cover type from cartographic variables. See https://archive.ics.uci.edu/ml/datasets/covertype for more details.

The purpose of this notebook is more to give an overview on how to do multi-class logistic regression using PyMC3.

Data Pre-Processing

Firstly, let's read in the data.

In [ ]:

df = pd.read_csv("../datasets/covtype_preprocess.csv", index_col=0)
df.shape

df["Cover_Type"] = df["Cover_Type"].apply(lambda x: str(x))

output_col = "Cover_Type"
input_cols = [c for c in df.columns if c != output_col]
input_formula = "".join(c + " + " for c in input_cols)
input_formula = input_formula + "-1"

import patsy
from sklearn.preprocessing import scale, normalize

X = patsy.dmatrix(formula_like=input_formula, data=df, return_type="dataframe")
# X = normalize(X)

Y = patsy.dmatrix(formula_like="Cover_Type -1", data=df, return_type="dataframe")
print(X.shape, Y.shape)

Target Variables

Firstly, let's get the target variables out as a multi-class table.

In [ ]:

output_col = "Cover_Type"
input_cols = [c for c in df.columns if c != output_col]
input_formula = "".join(c + " + " for c in input_cols)
input_formula = input_formula + "-1"

import patsy
from sklearn.preprocessing import scale, normalize

X = patsy.dmatrix(formula_like=input_formula, data=df, return_type="dataframe")
# X = normalize(X)

Y = patsy.dmatrix(formula_like="Cover_Type -1", data=df, return_type="dataframe")
print(X.shape, Y.shape)

Class Imbalance

Is there class imbalance in the dataset? Let's check this to see if we need to do some downsampling.

In [ ]:

Y.sum(axis=0)

Yes, there is class imbalance. Target 4 is about 100X less than targets 1 and 2, and about 10X less than targets 6 and 7. Need to downsample to that size.

Downsampling

We will downsample the data to just 2747 datapoints, and normalize the data using scikit-learn's normalize function from the sklearn.preprocessing module.

In [ ]:

downsampled_targets = []

for i in range(1, 7 + 1):
    # print(f'target[{i}]')
    target = Y[Y[f"Cover_Type[{i}]"] == 1]
    # print(len(target))
    downsampled_targets.append(target.sample(2747))

mms = MinMaxScaler()
X = pm.floatX(mms.fit_transform(df[input_cols]))

Y_downsamp = pd.concat(downsampled_targets)
X_downsamp = X[Y_downsamp.index]

Data Sanity Checks

Let's now check that the downsampled classes are indeed of the same shape.

In [ ]:

Y_downsamp.sum()

In [ ]:

plt.pcolor(Y_downsamp)

In [ ]:

X_downsamp.shape

In [ ]:

Y_downsamp.shape

In [ ]:

print(X.shape[1], Y.shape[1])

Also, let's visualize the distribution of data.

Missing Data

Firstly, checking for missing values.

In [ ]:

import missingno as msno

msno.matrix(pd.DataFrame(X_downsamp))

Min/Max Distribution

Next up, checking for distribution of min/max values.

In [ ]:

# plt.pcolor(X_downsamp)

I am satisfied that the data have been normalized correctly, and are in the correct shape. The caveats of correlations between columns still remain, but I won't deal with them for now, as I just want to get multi-class logistic regression going first.

Model Construction

I have chosen to do Bayesian logistic regression. In the original work, neural networks (NNs) were used. Because of the increased modelling capacity of NNs, I expect that they will perform better than Bayesian LR.

Nonetheless, as this is an exercise in implementing simple Bayesian models for others to use as a recipe for their own analyses, I will focus on model construction and critique, and not on model comparison. Thus, no cross-validation.

Logit Function

With that said, let's move onto the model. Firstly, we define the logit function:

logit(X) = \frac{1}{1 + e^{-X}}

In [ ]:

def logit(X):
    return 1 / (1 + np.exp(-X))

A quick test that the logit function will indeed work with an array of numbers:

In [ ]:

logit(np.array([1, 2, 1, 3]))

Model Specification

Now, we implement the model:

In [ ]:

# Bayesian multi-class logistic regression to predict the target classes.
import theano.tensor as tt

with pm.Model() as model:
    weights_testval = np.random.randn(X_downsamp.shape[1], Y_downsamp.shape[1])
    weights = pm.Normal(
        "weights", 0.0, 100.0, shape=(X_downsamp.shape[1], Y_downsamp.shape[1])
    )

    intercept = pm.Normal("intercept", 0.0, 100.0, shape=Y_downsamp.shape[1])

    exponent = tt.dot(X_downsamp, weights) + intercept
    p = logit(exponent)

    like = pm.Multinomial("likelihood", n=1, p=p, observed=np.asarray(Y_downsamp))

Some quick checks, with thanks to @junpenglao for providing this.

In [ ]:

print(model.logp(model.test_point))
for RV in model.basic_RVs:
    print(RV.name, RV.logp(model.test_point))

In [ ]:

with model:
    trace = pm.sample(50000, start=pm.find_MAP(), step=pm.Metropolis())
    # trace = pm.sample(2000, step=pm.Metropolis())  # NaN occurs in optimization

Traces

Visualize the traces to check for convergence in sampling.

In [ ]:

pm.traceplot(trace)

Interpretation

Sampling is pretty good. No trends in the intercepts or weights. In fact, putting 7 intercept terms (one for each class) caused shrinkage (not in the Bayesian hierarchical sense) of the weights to be closer to zeros, which I think is a good sign.

Model Evaluation

We will use posterior predictive checks to sample out new data from the posterior distributions of weights and intercepts.

Sample PPC

In [ ]:

with model:
    ppc_samps = pm.sample_ppc(trace, samples=100)

In [ ]:

ppc_samps["likelihood"].shape

Because of multi-class classification, let's label the class that has the highest probability to be the predicted label.

In [ ]:

probs = ppc_samps["likelihood"].mean(axis=0)
ppc_preds = probs == np.max(probs, axis=1, keepdims=True)

In [ ]:

probs

In [ ]:

plt.pcolor(probs)

In [ ]:

plt.pcolor(ppc_preds)

In [ ]:

# from scikitplot.plotters import plot_confusion_matrix
from sklearn.metrics import classification_report

Classification Report

In [ ]:

print(classification_report(Y_downsamp, ppc_preds))

In [ ]: