Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
mobook
GitHub Repository: mobook/mo-book
Path: blob/main/notebooks/02/08-L1-regression-wine-quality.ipynb
663 views
Kernel: Python 3 (ipykernel)

Extra material: Wine quality prediction with L1L_1 regression

Preamble: Install Pyomo and a solver

The following cell sets and verifies a global SOLVER for the notebook. If run on Google Colab, the cell installs Pyomo and the HiGHS solver, while, if run elsewhere, it assumes Pyomo and HiGHS have been previously installed. It then sets to use HiGHS as solver via the appsi module and a test is performed to verify that it is available. The solver interface is stored in a global object SOLVER for later use.

import sys if 'google.colab' in sys.modules: %pip install pyomo >/dev/null 2>/dev/null %pip install highspy >/dev/null 2>/dev/null solver = 'appsi_highs' import pyomo.environ as pyo SOLVER = pyo.SolverFactory(solver) assert SOLVER.available(), f"Solver {solver} is not available."

Problem description

Regression analysis aims to fit a predictive model to a dataset, and when executed successfully, this model can generate valuable forecasts for new data points. This notebook demonstrates how linear programming techniques coupled with Least Absolute Deviation (LAD) regression can construct a linear model to predict wine quality based on its physicochemical attributes. The example uses a well known data set from the machine learning community.

In this 2009 article by Cortez et al. comprehensive set of physical, chemical, and sensory quality metrics was gathered for an extensive range of red and white wines produced in Portugal. This dataset was subsequently contributed to the UCI Machine Learning Repository.

The next code cell downloads the red wine data directly from this repository.

import pandas as pd import matplotlib.pyplot as plt import numpy as np wines = pd.read_csv( "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";", ) display(wines)

Mean Absolute Deviation (MAD)

Given nn repeated observations of a response variable yiy_i (in this case, the wine quality), the mean absolute deviation (MAD) of yiy_i from the mean value yˉ\bar{y} is

MAD(y)=1ni=1nyiyˉ\text{MAD}\,(y) = \frac{1}{n} \sum_{i=1}^n | y_i - \bar{y}|
def MAD(df): return (df - df.mean()).abs().mean() print(f'MAD = {MAD(wines["quality"]):.5f}') plt.rcParams["font.size"] = 14 colors = plt.get_cmap("tab20c") fig, ax = plt.subplots(figsize=(12, 7)) ax = wines["quality"].plot(alpha=0.7, color=colors(0.0)) ax.axhline(wines["quality"].mean(), color=colors(0.2), ls="--", lw=3) mad = MAD(wines["quality"]) ax.axhline(wines["quality"].mean() + mad, color=colors(0.4), lw=3) ax.axhline(wines["quality"].mean() - mad, color=colors(0.4), lw=3) ax.legend(["data", "mean", "MAD"]) ax.set_xlabel("Observations") ax.set_ylabel("Quality") plt.tight_layout() plt.show()
MAD = 0.68318
Image in a Jupyter notebook

A preliminary look at the data

The data consists of 1,599 measurements of eleven physical and chemical characteristics plus an integer measure of sensory quality recorded on a scale from 3 to 8. Histograms provides insight into the values and variability of the data set.

plt.rcParams["font.size"] = 13 colors = plt.get_cmap("tab20c") fig, axes = plt.subplots(3, 4, figsize=(12, 8), sharey=True) for ax, column in zip(axes.flatten(), wines.columns): wines[column].hist(ax=ax, bins=30, color=colors(0.0), alpha=0.7) ax.axvline(wines[column].mean(), color=colors(0.2), label="mean") ax.set_title(column) ax.legend() plt.tight_layout()
Image in a Jupyter notebook

Which features influence reported wine quality?

The art of regression is to identify the features that have explanatory value for a response of interest. This is where a person with deep knowledge of an application area, in this case an experienced onenologist will have a head start compared to the naive data scientist. In the absence of the experience, we proceed by examining the correlation among the variables in the data set.

_ = wines.corr()["quality"].plot(kind="bar", grid=True, color=colors(0.0), alpha=0.7) plt.tight_layout()
Image in a Jupyter notebook
wines[["volatile acidity", "density", "alcohol", "quality"]].corr()

Collectively, these figures suggest alcohol is a strong correlate of quality, and several additional factors as candidates for explanatory variables..

LAD line fitting to identify features

An alternative approach is perform a series of single feature LAD regressions to determine which features have the largest impact on reducing the mean absolute deviations in the residuals.

min1IiIyiaxib\begin{align*} \min \frac{1}{I} \sum_{i\in I} \left| y_i - a x_i - b \right| \end{align*}

This computation has been presented in a prior notebook.

def lad_fit_1(df, y_col, x_col): m = pyo.ConcreteModel("LAD Regression Model") N = len(df) m.I = pyo.RangeSet(N) @m.Param(m.I) def y(m, i): return df.loc[i - 1, y_col] @m.Param(m.I) def X(m, i): return df.loc[i - 1, x_col] m.a, m.b = pyo.Var(), pyo.Var(domain=pyo.Reals) m.e_pos = pyo.Var(m.I, domain=pyo.NonNegativeReals) m.e_neg = pyo.Var(m.I, domain=pyo.NonNegativeReals) @m.Expression(m.I) def prediction(m, i): return m.a * m.X[i] + m.b @m.Constraint(m.I) def prediction_error(m, i): return m.e_pos[i] - m.e_neg[i] == m.prediction[i] - m.y[i] @m.Objective(sense=pyo.minimize) def MAD(m): return sum(m.e_pos[i] + m.e_neg[i] for i in m.I) / N return m m = lad_fit_1(wines, "quality", "alcohol") SOLVER.solve(m) print(f"The mean absolute deviation for a single-feature regression is {m.MAD():0.5f}")
The mean absolute deviation for a single-feature regression is 0.54117

This calculation is performed for all variables to determine which variables are the best candidates to explain deviations in wine quality.

variables = {} for i in wines.columns: m = lad_fit_1(wines, "quality", i) SOLVER.solve(m) variables[i] = m.MAD() mad = (wines["alcohol"] - wines["alcohol"].mean()).abs().mean() colors = plt.get_cmap("tab20c") fig, ax = plt.subplots() pd.Series(variables).plot(kind="bar", ax=ax, grid=True, alpha=0.7, color=colors(0.0)) ax.axhline(mad, color=colors(0.4), lw=3) ax.set_title("MADs for single-feature regressions") plt.tight_layout() plt.show()
Image in a Jupyter notebook
wines["prediction"] = [m.prediction[i]() for i in m.I] fig1, ax1 = plt.subplots(figsize=(6, 5)) wines["quality"].hist(label="data", alpha=0.7, color=plt.get_cmap("tab20c")(0)) ax1.set_xlabel("Quality") ax1.set_ylabel("Counts") plt.tight_layout() plt.show() fig2, ax2 = plt.subplots(figsize=(6, 6)) wines.plot( x="quality", y="prediction", kind="scatter", alpha=0.7, color=plt.get_cmap("tab20c")(0), ax=ax2, ) ax2.set_xlabel("Quality") ax2.set_ylabel("Prediction") ax2.set_aspect("equal", "box") min_val = min(ax2.get_xlim()[0], ax2.get_ylim()[0]) max_val = max(ax2.get_xlim()[1], ax2.get_ylim()[1]) ax2.set_xlim(min_val, max_val) ax2.set_ylim(min_val, max_val) plt.tight_layout() plt.show()
Image in a Jupyter notebookImage in a Jupyter notebook

Multivariate L1L_1-regression

Let us now perform a full multivariate L1L_1-regression on the wine dataset to predict the wine quality yy using the provided wine features. We aim to find the coefficients mjm_j's and bb that minimize the mean absolute deviation (MAD) by solving the following problem:

MAD(y^)=minm,b1ni=1nyiy^is. t.y^i=j=1Jxi,jmj+bi=1,,n,\begin{align*} \text{MAD}\,(\hat{y}) = \min_{m, \, b} \quad & \frac{1}{n} \sum_{i=1}^n | y_i - \hat{y}_i| \\ \\ \text{s. t.}\quad & \hat{y}_i = \sum_{j=1}^J x_{i, j} m_j + b & \forall i = 1, \dots, n, \end{align*}

where xi,jx_{i, j} are values of 'explanatory' variables, i.e., the 11 physical and chemical characteristics of the wines. By taking care of the absolute value appearing in the objective function, this can be implemented in Pyomo as a linear optimization problem as follows:

def l1_fit(df, y_col, x_cols): m = pyo.ConcreteModel("L1 Regression Model") N = len(df) m.I = pyo.RangeSet(N) m.J = pyo.Set(initialize=x_cols) @m.Param(m.I) def y(m, i): return df.loc[i - 1, y_col] @m.Param(m.I, m.J) def X(m, i, j): return df.loc[i - 1, j] m.a = pyo.Var(m.J) m.b = pyo.Var(domain=pyo.Reals) m.e_pos = pyo.Var(m.I, domain=pyo.NonNegativeReals) m.e_neg = pyo.Var(m.I, domain=pyo.NonNegativeReals) @m.Expression(m.I) def prediction(m, i): return sum(m.a[j] * m.X[i, j] for j in m.J) + m.b @m.Constraint(m.I) def prediction_error(m, i): return m.e_pos[i] - m.e_neg[i] == m.prediction[i] - m.y[i] @m.Objective(sense=pyo.minimize) def MAD(m): return sum(m.e_pos[i] + m.e_neg[i] for i in m.I) / N return m m = l1_fit( wines, "quality", [ "alcohol", "volatile acidity", "citric acid", "sulphates", "total sulfur dioxide", "density", "fixed acidity", ], ) SOLVER.solve(m) print(f"MAD = {m.MAD():0.5f}\n") for j in m.J: print(f"{j: <20} {pyo.value(m.a[j]):9.5f}") wines["prediction"] = [pyo.value(m.prediction[i]) for i in m.I] fig1, ax1 = plt.subplots(figsize=(7, 7)) wines["quality"].hist(label="data", alpha=0.7, color=plt.get_cmap("tab20c")(0)) ax1.set_xlabel("Quality") ax1.set_ylabel("Counts") plt.tight_layout() plt.show() fig2, ax2 = plt.subplots(figsize=(7, 7)) wines.plot( x="quality", y="prediction", kind="scatter", alpha=0.7, color=plt.get_cmap("tab20c")(0), ax=ax2, ) ax2.set_xlabel("Quality") ax2.set_ylabel("Prediction") ax2.set_aspect("equal", "box") min_val = min(ax2.get_xlim()[0], ax2.get_ylim()[0]) max_val = max(ax2.get_xlim()[1], ax2.get_ylim()[1]) ax2.set_xlim(min_val, max_val) ax2.set_ylim(min_val, max_val) plt.tight_layout() plt.show()
MAD = 0.49980 alcohol 0.34242 volatile acidity -0.98062 citric acid -0.28928 sulphates 0.90609 total sulfur dioxide -0.00219 density -18.50083 fixed acidity 0.06382
Image in a Jupyter notebookImage in a Jupyter notebook

How do these models perform?

A successful regression model would demonstrate a substantial reduction from MAD(y)\text{MAD}\,(y) to MAD(y^)\text{MAD}\,(\hat{y}). The value of MAD(y)\text{MAD}\,(y) sets a benchmark for the regression. The linear regression model clearly has some capability to explain the observed deviations in wine quality. Tabulating the results of the regression using the MAD statistic we find

RegressorsMAD
none0.683
alcohol only0.541
all0.500

Are these models good enough to replace human judgment of wine quality? The reader can be the judge.