Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
AllenDowney
GitHub Repository: AllenDowney/bayesian-analysis-recipes
Path: blob/master/incubator/multiclass-logistic-regression-cover-type.ipynb
411 views
Kernel: bayesian
import pymc3 as pm import numpy as np import pandas as pd import matplotlib.pyplot as plt import patsy from sklearn.preprocessing import normalize, MinMaxScaler %load_ext autoreload %autoreload 2 %matplotlib inline %config InlineBackend.figure_format = 'retina'

Introduction

Multi-class Bayesian Logistic Regression

The task at hand is to predict forest cover type from cartographic variables. See https://archive.ics.uci.edu/ml/datasets/covertype for more details.

The purpose of this notebook is more to give an overview on how to do multi-class logistic regression using PyMC3.

Data Pre-Processing

Firstly, let's read in the data.

df = pd.read_csv("../datasets/covtype_preprocess.csv", index_col=0) df.shape df["Cover_Type"] = df["Cover_Type"].apply(lambda x: str(x)) output_col = "Cover_Type" input_cols = [c for c in df.columns if c != output_col] input_formula = "".join(c + " + " for c in input_cols) input_formula = input_formula + "-1" import patsy from sklearn.preprocessing import scale, normalize X = patsy.dmatrix(formula_like=input_formula, data=df, return_type="dataframe") # X = normalize(X) Y = patsy.dmatrix(formula_like="Cover_Type -1", data=df, return_type="dataframe") print(X.shape, Y.shape)

Target Variables

Firstly, let's get the target variables out as a multi-class table.

output_col = "Cover_Type" input_cols = [c for c in df.columns if c != output_col] input_formula = "".join(c + " + " for c in input_cols) input_formula = input_formula + "-1" import patsy from sklearn.preprocessing import scale, normalize X = patsy.dmatrix(formula_like=input_formula, data=df, return_type="dataframe") # X = normalize(X) Y = patsy.dmatrix(formula_like="Cover_Type -1", data=df, return_type="dataframe") print(X.shape, Y.shape)

Class Imbalance

Is there class imbalance in the dataset? Let's check this to see if we need to do some downsampling.

Y.sum(axis=0)

Yes, there is class imbalance. Target 4 is about 100X less than targets 1 and 2, and about 10X less than targets 6 and 7. Need to downsample to that size.

Downsampling

We will downsample the data to just 2747 datapoints, and normalize the data using scikit-learn's normalize function from the sklearn.preprocessing module.

downsampled_targets = [] for i in range(1, 7 + 1): # print(f'target[{i}]') target = Y[Y[f"Cover_Type[{i}]"] == 1] # print(len(target)) downsampled_targets.append(target.sample(2747)) mms = MinMaxScaler() X = pm.floatX(mms.fit_transform(df[input_cols])) Y_downsamp = pd.concat(downsampled_targets) X_downsamp = X[Y_downsamp.index]

Data Sanity Checks

Let's now check that the downsampled classes are indeed of the same shape.

Y_downsamp.sum()
plt.pcolor(Y_downsamp)
X_downsamp.shape
Y_downsamp.shape
print(X.shape[1], Y.shape[1])

Also, let's visualize the distribution of data.

Missing Data

Firstly, checking for missing values.

import missingno as msno msno.matrix(pd.DataFrame(X_downsamp))

Min/Max Distribution

Next up, checking for distribution of min/max values.

# plt.pcolor(X_downsamp)

I am satisfied that the data have been normalized correctly, and are in the correct shape. The caveats of correlations between columns still remain, but I won't deal with them for now, as I just want to get multi-class logistic regression going first.

Model Construction

I have chosen to do Bayesian logistic regression. In the original work, neural networks (NNs) were used. Because of the increased modelling capacity of NNs, I expect that they will perform better than Bayesian LR.

Nonetheless, as this is an exercise in implementing simple Bayesian models for others to use as a recipe for their own analyses, I will focus on model construction and critique, and not on model comparison. Thus, no cross-validation.

Logit Function

With that said, let's move onto the model. Firstly, we define the logit function:

logit(X)=11+eXlogit(X) = \frac{1}{1 + e^{-X}}
def logit(X): return 1 / (1 + np.exp(-X))

A quick test that the logit function will indeed work with an array of numbers:

logit(np.array([1, 2, 1, 3]))

Model Specification

Now, we implement the model:

# Bayesian multi-class logistic regression to predict the target classes. import theano.tensor as tt with pm.Model() as model: weights_testval = np.random.randn(X_downsamp.shape[1], Y_downsamp.shape[1]) weights = pm.Normal( "weights", 0.0, 100.0, shape=(X_downsamp.shape[1], Y_downsamp.shape[1]) ) intercept = pm.Normal("intercept", 0.0, 100.0, shape=Y_downsamp.shape[1]) exponent = tt.dot(X_downsamp, weights) + intercept p = logit(exponent) like = pm.Multinomial("likelihood", n=1, p=p, observed=np.asarray(Y_downsamp))

Some quick checks, with thanks to @junpenglao for providing this.

print(model.logp(model.test_point)) for RV in model.basic_RVs: print(RV.name, RV.logp(model.test_point))
with model: trace = pm.sample(50000, start=pm.find_MAP(), step=pm.Metropolis()) # trace = pm.sample(2000, step=pm.Metropolis()) # NaN occurs in optimization

Traces

Visualize the traces to check for convergence in sampling.

pm.traceplot(trace)

Interpretation

Sampling is pretty good. No trends in the intercepts or weights. In fact, putting 7 intercept terms (one for each class) caused shrinkage (not in the Bayesian hierarchical sense) of the weights to be closer to zeros, which I think is a good sign.

Model Evaluation

We will use posterior predictive checks to sample out new data from the posterior distributions of weights and intercepts.

Sample PPC

with model: ppc_samps = pm.sample_ppc(trace, samples=100)
ppc_samps["likelihood"].shape

Because of multi-class classification, let's label the class that has the highest probability to be the predicted label.

probs = ppc_samps["likelihood"].mean(axis=0) ppc_preds = probs == np.max(probs, axis=1, keepdims=True)
probs
plt.pcolor(probs)
plt.pcolor(ppc_preds)
# from scikitplot.plotters import plot_confusion_matrix from sklearn.metrics import classification_report

Classification Report

print(classification_report(Y_downsamp, ppc_preds))