Path: blob/main/notebooks/05/04-svm-binary-classification.ipynb
663 views
5.4 Support Vector Machines for binary classification
Support Vector Machines (SVM) are a type of supervised machine learning model. Similar to other machine learning techniques based on regression, training an SVM classifier uses examples with known outcomes, and involves optimization some measure of performance. The resulting classifier can then be applied to classify data with unknown outcomes.
In this notebook, we will demonstrate the process of training an SVM for binary classification using linear and quadratic optimization models. Our implementation will initially focus on linear support vector machines which separate the feature space by means of a hyperplane. We will explore both primal and dual formulations. Then, using kernels, the dual formulation is extended to binary classification in higher-order and nonlinear feature spaces. Several different formulations of the optimization problem are given in Pyomo and applied to a banknote classification application.
Preamble: Install Pyomo and a solver
This cell selects and verifies a global SOLVER for the notebook. If run on Google Colab, the cell installs Pyomo and ipopt, then sets SOLVER to use the ipopt solver. If run elsewhere, it assumes Pyomo and the Mosek solver have been previously installed and sets SOLVER to use the Mosek solver via the Pyomo SolverFactory. It then verifies that SOLVER is available. For linear problems, the solver HiGHS is imported and used.
Binary classification
Binary classifiers are functions designed to answer questions such as "does this medical test indicate disease?", "will this specific customer enjoy that specific movie?", "does this photo include a car?", or "is this banknote genuine or counterfeit?" These questions are answered based on the values of "features" that may include physical measurements or other types of data collected from a representative data set with known outcomes.
In this notebook we consider a binary classifier that might be installed in a vending machine to detect banknotes. The goal of the device is to accurately identify and accept genuine banknotes while rejecting counterfeit ones. The classifier's performance can be assessed using definitions in following table, where "positive" refers to an instance of a genuine banknote.
| | Predicted Positive | Predicted Negative | | :-- | :--: | :--: | | Actual Positive | True Positive (TP) | False Negative (FN) | | Actual Negative | False Positive (FP) | True Negative (TN) |
A vending machine user would be frustrated if a genuine banknote is incorrectly rejected as a false negative. Sensitivity is defined as the number of true positives (TP) divided by the total number of actual positives (TP + FN). A user of the vending machine would prefer high sensitivity because that means genuine banknotes are likely to be accepted.
The vending machine owner/operator, on the other hand, wants to avoid accepting counterfeit banknotes and would therefore prefer a low number of false positives (FP). Precision is the number of true positives (TP) divided by the total number of predicted positives (TP + FP). The owner/operate would prefer high precision because that means almost all the accepted notes are genuine.
To achieve high sensitivity, a classifier can follow the "innocent until proven guilty" standard, rejecting banknotes only when certain they are counterfeit. To achieve high precision, a classifier can adopt the "guilty unless proven innocent" standard, rejecting banknotes unless absolutely certain they are genuine.
The challenge in developing binary classifiers is to balance these conflicting objectives and to optimize performance from both perspectives at the same time.
The data set
The following data set contains measurements from a collection of known genuine and known counterfeit banknote specimens. The data is taken from https://archive.ics.uci.edu/ml/datasets/banknote+authentication and includes four continuous statistical measures obtained from the wavelet transform of banknote images named "variance", "skewness", "curtosis", and "entropy", and a binary variable named "class" which is 0 if genuine and 1 if counterfeit.
Read data
Using built-in pandas functionalities, we can get a quick overview of the data set.
Select features and training sets
We divide the data set into a training set for training the classifier, and a testing set for evaluating the performance of the trained classifier. In addition, we select a two dimensional subset of the features so that the results can be plotted for better exposition. Since our definition of a positive outcome corresponds to detecting a genuine banknote, the "class" feature is scaled to have values of 1 for genuine banknotes and -1 for counterfeit banknotes.
The following cell defines a function scatter that produces a 2D scatter plots of a labeled features. The function assigns default labels and colors, and otherwise passes along other keyword arguments.
Support vector machines (SVM)
Linear SVM classifier
A linear support vector machine (SVM) is a binary classification method that employs a linear equation to determine class assignment. The basic formula is expressed as:
where is a point in "feature" space. Here represents a set of coefficients, is the dot product, and is a scalar coefficient. The hyperplane defined by and separates the feature space into two classes. Points on one side of the hyperplane are have a positive outcome (+1); while points on the other side have a negative outcome (-1).
The following cell presents a simple Python implementation of a linear SVM. An instance of LinearSVM is defined with a coefficient vector and a scalar . In this implementation, all data and parameters are provided as Pandas Series or DataFrame objects, and the Pandas .dot() function is used to compute the dot product.
A visual inspection of the banknote training set shows the two dimensional feature set can be approximately split along a vertical axis where "variance" is zero. Most of the positive outcomes are on the right of the axis, most of the negative outcomes on the left. Since is a vector normal to this surface, we choose
The code cell below evaluates the accuracy of the linear SVM by calculating the accuracy score, which is the fraction of samples that were predicted accurately.
Performance metrics
The accuracy score alone is not always a reliable metric for evaluating the performance of binary classifiers. For instance, when one outcome is significantly more frequent than the other, a classifier that always predicts the more common outcome without regard to the feature vector can achieve. Moreover, in many applications, the consequences of a false positive can differ from those of a false negative. For these reasons, we seek a more comprehensive set of metrics to compare binary classifiers. A detailed discussion on this topic recommends the Matthews correlation coefficient (MCC) as a reliable performance measure for binary classifiers.
The code below demonstrates an example of a function that evaluates the performance of a binary classifier and returns the Matthews correlation coefficient as its output. The function validate calculates and displays the sensitivity, precision, and Matthews correlation coefficient (MCC) for a binary classifier based on its true labels (y_true) and predicted labels (y_pred).
Linear optimization model
A training or validation set consists of observations where and for . The training task is to find coefficients and to achieve high sensitivity and high precision for the validation set. All points for are successfully classified if
As written, this condition imposes no scale for or (that is, if the condition is satisfied for any pair , then it also satisfied for where ). To remove the ambiguity, a modified condition for correctly classified points is given by
which defines a hard-margin classifier. The size of the margin is determined by the scale of and .
In practice, it is not always possible to find and that perfectly separate all data. The condition for a hard-margin classifier is therefore relaxed by introducing non-negative decision variables where
The variables measure the distance of a misclassified point from the separating hyperplane. An equivalent notation is to rearrange this expression as
which is hinge-loss function. The training problem is formulated as minimizing the hinge-loss function over all the data samples:
Practice has shown that minimizing this term alone produces classifiers with large entries for which performs poorly on new data samples. For that reason, regularization adds a term to penalize the magnitude of . In most formulations a norm is used for regularization, commonly a sum of squares such as . Another choice is which, similar to Lasso regression, may result in sparse weighting vector indicating the elements of the feature vector that can be neglected for classification purposes. These considerations result in the objective function
The needed weights are a solution to following linear optimization problem:
This is the primal optimization problem in decision variables , , and , a total of unknowns with constraints. This can be recast as a linear optimization problem with the usual technique of setting where and are non-negative. Then
Pyomo implementation
The Pyomo implementation is a factory function. The function accepts a set of training data, creates and solves a Pyomo ConcreteModel for and , then returns a trained LinearSVM object that can be applied to a other feature data.
Quadratic optimization model
Primal form
The standard formulation of a linear support vector machine uses training sets with -element feature vectors along with classification labels for those vectors, . A classifier is defined by two parameters: a weight vector and a bias term
If a separating hyperplane exists, then we choose and so that a hard-margin classifier exists for the training set where
This can always be done if a separating hyperplane exists. But if a separating hyperplane does not exist, we introduce non-negative slack variables to relax the constraints and settle for a soft-margin classifier
The training objective is to minimize the total distance to misclassified data points. This leads to the optimization problem
where is included to regularize the solution for . Choosing larger values of will reduce the number and size of misclassifications. The trade-off will be larger weights and the accompanying risk of over over-fitting the training data.
The following cell creates a support vector machine (SVM) model using a quadratic optimization starting from data.
Dual Formulation
The dual formulation for the SVM provides insight into how a linear SVM works and essential for extending SVM to nonlinear classification. The dual formulation begins by creating a differentiable Lagrangian with dual variables and for . The task is to find saddle points of
Taking derivatives with respect to the primal variables
This can be arranged in the form of a standard quadratic optimization problem in variables for .
The symmetric Gram matrix is defined as
where each entry is dot product of two vectors .
Compared to the primal, the dual formulation appears to have reduced the number of decision variables from to . But this has come with the penalty of introducing a dense matrix with coefficients and potential processing time of order . For large training sets where or even larger, this becomes a prohibitively expensive calculation. In addition, the Gram matrix will be rank deficient for cases .
We can eliminates the need to compute and store the full Gram matrix by introducing the matrix
Then which brings the primal variables back into the computational problem. The optimization problem becomes
The solution for the bias term is obtained by considering the complementarity conditions on the dual variables. The slack variables are zero if which is equivalent to . If then . Putting these facts together gives a formula for
This model is implemented below.
Kernelized SVM
Nonlinear feature spaces
A linear SVM assumes the existence of a linear hyperplane that separates labeled sets of data points. Frequently, however, this is not possible and some sort of nonlinear method is needed.
Consider a binary classification done given by a function
where is a function mapping into a higher dimensional "feature space". That is, where . The additional dimensions may include features such as powers of the terms in , or products of those terms, or other types of nonlinear transformations. As before, we wish to find a choice for such that the soft-margin classifier
Using the machinery as before, we set up the Lagrangian
then take derivatives to find
This is similar to the case of a linear SVM, but now the vector of weights which can be a high dimensional space with nonlinear features. Working through the algebra, we are once again left with a quadratic optimization problem in variables for .
where the resulting classifier is given by
The kernel trick
This is an interesting situation where the separating hyperplane is embedded in a high dimensional space of nonlinear features determined by the mapping , but all we need for computation are the inner products to train the classifier, and the inner products to use the classifier. If we had a function that returned the value then we would never need to actually compute , or their inner product.
Mercer's theorem turns the analysis on its head by specifying conditions for which a function to be expressed as an inner product for some . If is symmetric (i.e, , and if the Gram matrix constructed for any collection of points
is positive semi-definite, then there is some for which is an inner product. We call such functions kernels. The practical consequence is that we can train and implement nonlinear classifiers using kernel and without ever needing to compute the higher dimensional features. This remarkable result is called the "kernel trick".
Implementation
To take advantage of the kernel trick, we assume an appropriate kernel has been identified, then replace all instances of with the kernel. The "kernelized" SVM is given by a solution to
where
where the resulting classifier is given by
We define the positive symmetric semi-definite Gram matrix
We factor where has dimensions and where is the rank of . The factorization is not unique. As demonstrated in the Python code below, one suitable factorization is the spectral factorization where is a diagonal matrix of non-zero eigenvalues, and is an normal matrix such that . Then
Once this factorization is complete, the optimization problem for the kernalized SVM is the same as for the linear SVM in the dual formulation
The result is a quadratic optimization model for the dual coefficients and auxiliary variables .
Summarizing, the essential difference between training the linear and kernelized SVM is the need to compute and factor the Gram matrix. The result will be a set of non-zero coefficients the define a set of support vectors . The classifier is then given by
The implementation of the kernelized SVM is split into two parts. The first part is a class used to create instances of the classifier.
The second part of the implementation is a factory function containing the optimization model for training an SVM. Given training data and a kernal function, the factory returns an instance of a kernelized SVM. The default for the kernal function is a linear kernel. An additional optional argument is the scalar tol, which is the tolerance for the eigenvalue threshold.
Linear kernel
For comparison with the previous cases, the first kernel we consider is a linear kernel
which should reproduce the results obtained earlier.
Polynomial kernels
A polynomial kernel of order has the form
The following cell demonstrates a quadratic kernel applied to the banknote data.