2.2 Least Absolute Deviation (LAD) Regression
Linear regression is a supervised machine learning technique that dates back to at least the 19th century. It remains a cornerstone of modern data analysis, generating a linear model that predicts the values of a dependent variable based on one or more independent variables. This notebook introduces an alternative approach to traditional linear regression, employing linear optimization to optimize based on the Least Absolute Deviation (LAD) metric.
Unlike standard techniques that aim to minimize the sum of squared errors, this LAD-based method focuses on minimizing the sum of absolute differences between observed and estimated values. This corresponds to considering the norm of the errors, which is known for its robustness against outliers. The methodology presented here follows closely this survey paper by Subash Narula and John Wellington.
Preamble: Install Pyomo and a solver
The following cell sets and verifies a global SOLVER for the notebook. If run on Google Colab, the cell installs Pyomo and the HiGHS solver, while, if run elsewhere, it assumes Pyomo and HiGHS have been previously installed. It then sets to use HiGHS as solver via the appsi module and a test is performed to verify that it is available. The solver interface is stored in a global object SOLVER for later use.
Generate and visualize data
The Python scikit learn library for machine learning provides a full-featured collection of tools for regression. The following cell uses make_regression from scikit learn to generate a synthetic data set for use in subsequent cells. The data consists of a numpy array y containing n_samples of one dependent variable , and an array X containing n_samples observations of n_features independent explanatory variables.
Before going further, it is generally useful to prepare an initial visualization of the data. The following cell presents a scatter plot of versus for the special case of one explanatory variable, and a histogram of the difference between and the mean value . This histogram will provide a reference against which to compare the residual error in after regression.
Model
Suppose that we have a finite dataset consisting of points with and . A linear regression model assumes the relationship between the vector of regressors and the dependent variable is linear. This relationship is modeled through an error or deviation term , which quantifies how much each of the data points diverge from the model prediction and is defined as follows:
for some real numbers and .
The Least Absolute Deviation (LAD) is a possible statistical optimality criterion for such a linear regression. Similarly to the well-known least-squares technique, it attempts to find a vector of linear coefficients and intercept so that the model closely approximates the given set of data. The method minimizes the sum of absolute errors, that is, .
The LAD regression is formulated as an optimization problem with the intercept , the coefficients 's, and the errors 's as decision variables, namely
In general, the appearance of an absolute value term indicates the problem is nonlinear and, worse, that the objective function is not differentiable when any . However, for this case where the objective is to minimize a sum of absolute errors, one can reformulate the decision variables to transform this into a linear problem. More specifically, introducing for every term two new variables , we can rewrite the model as
The following cell provides a direct implementation of LAD regression.
In the above code we used Pyomo's `RangeSet' component to define a set that consists of a range of numerical values. It is a convenient way to specify indices for decision variables, constraints, and other model components. Note that, unlike Python's native range function, RangeSet is inclusive of both the start and end values.
Visualizing the results
If we have a model with a single feature, we can show in the same plot both the original data points (as blue dots) and the fitted model (as a red line) obtained using the optimal coefficients and found above. This is useful for visualizing how well the model fits the data.
In a second plot is generated that displays a histogram of residuals, calculated as the difference between the actual values y and the fitted values y_fit. The histogram provides insights into the distribution of errors, which is important for model evaluation.