Path: blob/master/incubator/multiclass-logistic-regression-cover-type.ipynb
411 views
Introduction
Multi-class Bayesian Logistic Regression
The task at hand is to predict forest cover type from cartographic variables. See https://archive.ics.uci.edu/ml/datasets/covertype for more details.
The purpose of this notebook is more to give an overview on how to do multi-class logistic regression using PyMC3.
Data Pre-Processing
Firstly, let's read in the data.
Target Variables
Firstly, let's get the target variables out as a multi-class table.
Class Imbalance
Is there class imbalance in the dataset? Let's check this to see if we need to do some downsampling.
Yes, there is class imbalance. Target 4 is about 100X less than targets 1 and 2, and about 10X less than targets 6 and 7. Need to downsample to that size.
Downsampling
We will downsample the data to just 2747 datapoints, and normalize the data using scikit-learn
's normalize
function from the sklearn.preprocessing
module.
Data Sanity Checks
Let's now check that the downsampled classes are indeed of the same shape.
Also, let's visualize the distribution of data.
Missing Data
Firstly, checking for missing values.
Min/Max Distribution
Next up, checking for distribution of min/max values.
I am satisfied that the data have been normalized correctly, and are in the correct shape. The caveats of correlations between columns still remain, but I won't deal with them for now, as I just want to get multi-class logistic regression going first.
Model Construction
I have chosen to do Bayesian logistic regression. In the original work, neural networks (NNs) were used. Because of the increased modelling capacity of NNs, I expect that they will perform better than Bayesian LR.
Nonetheless, as this is an exercise in implementing simple Bayesian models for others to use as a recipe for their own analyses, I will focus on model construction and critique, and not on model comparison. Thus, no cross-validation.
Logit Function
With that said, let's move onto the model. Firstly, we define the logit function:
A quick test that the logit function will indeed work with an array of numbers:
Model Specification
Now, we implement the model:
Some quick checks, with thanks to @junpenglao for providing this.
Traces
Visualize the traces to check for convergence in sampling.
Interpretation
Sampling is pretty good. No trends in the intercepts or weights. In fact, putting 7 intercept terms (one for each class) caused shrinkage (not in the Bayesian hierarchical sense) of the weights to be closer to zeros, which I think is a good sign.
Model Evaluation
We will use posterior predictive checks to sample out new data from the posterior distributions of weights and intercepts.
Sample PPC
Because of multi-class classification, let's label the class that has the highest probability to be the predicted label.