Path: blob/main/notebooks/02/08-L1-regression-wine-quality.ipynb
663 views
Extra material: Wine quality prediction with regression
Preamble: Install Pyomo and a solver
The following cell sets and verifies a global SOLVER for the notebook. If run on Google Colab, the cell installs Pyomo and the HiGHS solver, while, if run elsewhere, it assumes Pyomo and HiGHS have been previously installed. It then sets to use HiGHS as solver via the appsi module and a test is performed to verify that it is available. The solver interface is stored in a global object SOLVER for later use.
Problem description
Regression analysis aims to fit a predictive model to a dataset, and when executed successfully, this model can generate valuable forecasts for new data points. This notebook demonstrates how linear programming techniques coupled with Least Absolute Deviation (LAD) regression can construct a linear model to predict wine quality based on its physicochemical attributes. The example uses a well known data set from the machine learning community.
In this 2009 article by Cortez et al. comprehensive set of physical, chemical, and sensory quality metrics was gathered for an extensive range of red and white wines produced in Portugal. This dataset was subsequently contributed to the UCI Machine Learning Repository.
The next code cell downloads the red wine data directly from this repository.
Mean Absolute Deviation (MAD)
Given repeated observations of a response variable (in this case, the wine quality), the mean absolute deviation (MAD) of from the mean value is
A preliminary look at the data
The data consists of 1,599 measurements of eleven physical and chemical characteristics plus an integer measure of sensory quality recorded on a scale from 3 to 8. Histograms provides insight into the values and variability of the data set.
Which features influence reported wine quality?
The art of regression is to identify the features that have explanatory value for a response of interest. This is where a person with deep knowledge of an application area, in this case an experienced onenologist will have a head start compared to the naive data scientist. In the absence of the experience, we proceed by examining the correlation among the variables in the data set.
Collectively, these figures suggest alcohol is a strong correlate of quality, and several additional factors as candidates for explanatory variables..
LAD line fitting to identify features
An alternative approach is perform a series of single feature LAD regressions to determine which features have the largest impact on reducing the mean absolute deviations in the residuals.
This computation has been presented in a prior notebook.
This calculation is performed for all variables to determine which variables are the best candidates to explain deviations in wine quality.
Multivariate -regression
Let us now perform a full multivariate -regression on the wine dataset to predict the wine quality using the provided wine features. We aim to find the coefficients 's and that minimize the mean absolute deviation (MAD) by solving the following problem:
where are values of 'explanatory' variables, i.e., the 11 physical and chemical characteristics of the wines. By taking care of the absolute value appearing in the objective function, this can be implemented in Pyomo as a linear optimization problem as follows:
How do these models perform?
A successful regression model would demonstrate a substantial reduction from to . The value of sets a benchmark for the regression. The linear regression model clearly has some capability to explain the observed deviations in wine quality. Tabulating the results of the regression using the MAD statistic we find
| Regressors | MAD |
|---|---|
| none | 0.683 |
| alcohol only | 0.541 |
| all | 0.500 |
Are these models good enough to replace human judgment of wine quality? The reader can be the judge.