Path: blob/master/site/en-snapshot/probability/examples/Generalized_Linear_Models.ipynb
25118 views
Copyright 2018 The TensorFlow Probability Authors.
Licensed under the Apache License, Version 2.0 (the "License");
Generalized Linear Models
In this notebook we introduce Generalized Linear Models via a worked example. We solve this example in two different ways using two algorithms for efficiently fitting GLMs in TensorFlow Probability: Fisher scoring for dense data, and coordinatewise proximal gradient descent for sparse data. We compare the fitted coefficients to the true coefficients and, in the case of coordinatewise proximal gradient descent, to the output of R's similar glmnet
algorithm. Finally, we provide further mathematical details and derivations of several key properties of GLMs.
Background
A generalized linear model (GLM) is a linear model () wrapped in a transformation (link function) and equipped with a response distribution from an exponential family. The choice of link function and response distribution is very flexible, which lends great expressivity to GLMs. The full details, including a sequential presentation of all the definitions and results building up to GLMs in unambiguous notation, are found in "Derivation of GLM Facts" below. We summarize:
In a GLM, a predictive distribution for the response variable is associated with a vector of observed predictors . The distribution has the form:
Here are the parameters ("weights"), a hyperparameter representing dispersion ("variance"), and , , , are characterized by the user-specified model family.
The mean of depends on by composition of linear response and (inverse) link function, i.e.:
where is the so-called link function. In TFP the choice of link function and model family are jointly specifed by a tfp.glm.ExponentialFamily
subclass. Examples include:
tfp.glm.Normal
, aka "linear regression"tfp.glm.Bernoulli
, aka "logistic regression"tfp.glm.Poisson
, aka "Poisson regression"tfp.glm.BernoulliNormalCDF
, aka "probit regression".
TFP prefers to name model families according to the distribution over Y
rather than the link function since tfp.Distribution
s are already first-class citizens. If the tfp.glm.ExponentialFamily
subclass name contains a second word, this indicates a non-canonical link function.
GLMs have several remarkable properties which permit efficient implementation of the maximum likelihood estimator. Chief among these properties are simple formulas for the gradient of the log-likelihood , and for the Fisher information matrix, which is the expected value of the Hessian of the negative log-likelihood under a re-sampling of the response under the same predictors. I.e.:
where is the matrix whose th row is the predictor vector for the th data sample, and is vector whose th coordinate is the observed response for the th data sample. Here (loosely speaking), and , and boldface denotes vectorization of these functions. Full details of what distributions these expectations and variances are over can be found in "Derivation of GLM Facts" below.
An Example
In this section we briefly describe and showcase two built-in GLM fitting algorithms in TensorFlow Probability: Fisher scoring (tfp.glm.fit
) and coordinatewise proximal gradient descent (tfp.glm.fit_sparse
).
Synthetic Data Set
Let's pretend to load some training data set.
Note: Connect to a local runtime.
In this notebook, we share data between Python and R kernels using local files. To enable this sharing, please use runtimes on the same machine where you have permission to read and write local files.
Without L1 Regularization
The function tfp.glm.fit
implements Fisher scoring, which takes as some of its arguments:
model_matrix
=response
=model
= callable which, given argument , returns the triple .
We recommend that model
be an instance of the tfp.glm.ExponentialFamily
class. There are several pre-made implementations available, so for most common GLMs no custom code is necessary.
Mathematical Details
Fisher scoring is a modification of Newton's method to find the maximum-likelihood estimate
Vanilla Newton's method, searching for zeros of the gradient of the log-likelihood, would follow the update rule
where is a learning rate used to control the step size.
In Fisher scoring, we replace the Hessian with the negative Fisher information matrix:
[Note that here is random, whereas is still the vector of observed responses.]
By the formulas in "Fitting GLM Parameters To Data" below, this simplifies to
With L1 Regularization
tfp.glm.fit_sparse
implements a GLM fitter more suited to sparse data sets, based on the algorithm in Yuan, Ho and Lin 2012. Its features include:
L1 regularization
No matrix inversions
Few evaluations of the gradient and Hessian.
We first present an example usage of the code. Details of the algorithm are further elaborated in "Algorithm Details for tfp.glm.fit_sparse
" below.
Note that the learned coefficients have the same sparsity pattern as the true coefficients.
Compare to R's glmnet
We compare the output of coordinatewise proximal gradient descent to that of R's glmnet
, which uses a similar algorithm.
NOTE: To execute this section, you must switch to an R colab runtime.
Compare R, TFP and true coefficients (Note: back to Python kernel)
Algorithm Details for tfp.glm.fit_sparse
We present the algorithm as a sequence of three modifications to Newton's method. In each one, the update rule for is based on a vector and a matrix which approximate the gradient and Hessian of the log-likelihood. In step , we choose a coordinate to change, and we update according to the update rule:
This update is a Newton-like step with learning rate . Except for the final piece (L1 regularization), the modifications below differ only in how they update and .
Starting point: Coordinatewise Newton's method
In coordinatewise Newton's method, we set and to the true gradient and Hessian of the log-likelihood:
Fewer evaluations of the gradient and Hessian
The gradient and Hessian of the log-likelihood are often expensive to compute, so it is often worthwhile to approximate them. We can do so as follows:
Usually, approximate the Hessian as locally constant and approximate the gradient to first order using the (approximate) Hessian:
Occasionally, perform a "vanilla" update step as above, setting to the exact gradient and to the exact Hessian of the log-likelihood, evaluated at .
Substitute negative Fisher information for Hessian
To further reduce the cost of the vanilla update steps, we can set to the negative Fisher information matrix (efficiently computable using the formulas in "Fitting GLM Parameters to Data" below) rather than the exact Hessian:
L1 Regularization via Proximal Gradient Descent
To incorporate L1 regularization, we replace the update rule
with the more general update rule
where is a supplied constant (the L1 regularization coefficient) and is the soft thresholding operator, defined by
This update rule has the following two inspirational properties, which we explain below:
In the limiting case (i.e., no L1 regularization), this update rule is identical to the original update rule.
This update rule can be interpreted as applying a proximity operator whose fixed point is the solution to the L1-regularized minimization problem
Degenerate case recovers the original update rule
To see (1), note that if then , hence
Hence
Proximity operator whose fixed point is the regularized MLE
To see (2), first note (see Wikipedia) that for any , the update rule
satisfies (2), where is the proximity operator (see Yu, where this operator is denoted ). The right-hand side of the above equation is computed here:
In particular, setting (note that as long as the negative log-likelihood is convex), we obtain the update rule
We then replace the exact gradient with its approximation , obtaining
Hence
Derivation of GLM Facts
In this section we state in full detail and derive the results about GLMs that are used in the preceding sections. Then, we use TensorFlow's gradients
to numerically verify the derived formulas for gradient of the log-likelihood and Fisher information.
Score and Fisher information
Consider a family of probability distributions parameterized by parameter vector , having probability densities . The score of an outcome at parameter vector is defined to be the gradient of the log likelihood of (evaluated at ), that is,
Claim: Expectation of the score is zero
Under mild regularity conditions (permitting us to pass differentiation under the integral),
Proof
We have
where we have used: (1) chain rule for differentiation, (2) definition of expectation, (3) passing differentiation under the integral sign (using the regularity conditions), (4) the integral of a probability density is 1.
Claim (Fisher information): Variance of the score equals negative expected Hessian of the log likelihood
Under mild regularity conditions (permitting us to pass differentiation under the integral),
where denotes the Hessian matrix, whose entry is .
The left-hand side of this equation is called the Fisher information of the family at parameter vector .
Proof of claim
We have
where we have used (1) chain rule for differentiation, (2) quotient rule for differentiation, (3) chain rule again, in reverse.
To complete the proof, it suffices to show that
To do that, we pass differentiation under the integral sign twice:
Lemma about the derivative of the log partition function
If , and are scalar-valued functions, twice differentiable, such that the family of distributions defined by
satisfies the mild regularity conditions that permit passing differentiation with respect to under an integral with respect to , then
and
(Here denotes differentiation, so and are the first and second derivatives of . )
Proof
For this family of distributions, we have . The first equation then follows from the fact that . Next, we have
Overdispersed Exponential Family
A (scalar) overdispersed exponential family is a family of distributions whose densities take the form
where and are known scalar-valued functions, and and are scalar parameters.
[Note that is overdetermined: for any , the function is completely determined by the constraint that for all . The 's produced by different values of must all be the same, which places a constraint on the functions and .]
Mean and variance of the sufficient statistic
Under the same conditions as "Lemma about the derivative of the log partition function," we have
and
Proof
By "Lemma about the derivative of the log partition function," we have
and
The result then follows from the fact that expectation is linear () and variance is degree-2 homogeneous ().
Generalized Linear Model
In a generalized linear model, a predictive distribution for the response variable is associated with a vector of observed predictors . The distribution is a member of an overdispersed exponential family, and the parameter is replaced by where is a known function, is the so-called linear response, and is a vector of parameters (regression coefficients) to be learned. In general the dispersion parameter could be learned too, but in our setup we will treat as known. So our setup is
where the model structure is characterized by the distribution and the function which converts linear response to parameters.
Traditionally, the mapping from linear response to mean is denoted
This mapping is required to be one-to-one, and its inverse, , is called the link function for this GLM. Typically, one describes a GLM by naming its link function and its family of distributions -- e.g., a "GLM with Bernoulli distribution and logit link function" (also known as a logistic regression model). In order to fully characterize the GLM, the function must also be specified. If is the identity, then is said to be the canonical link function.
Claim: Expressing in terms of the sufficient statistic
Define
and
Then we have
Proof
By "Mean and variance of the sufficient statistic," we have
Differentiating with the chain rule, we obtain
and by "Mean and variance of the sufficient statistic,"
The conclusion follows.
Fitting GLM Parameters to Data
The properties derived above lend themselves very well to fitting GLM parameters to a data set. Quasi-Newton methods such as Fisher scoring rely on the gradient of the log likelihood and the Fisher information, which we now show can be computed especially efficiently for a GLM.
Suppose we have observed predictor vectors and associated scalar responses . In matrix form, we'll say we have observed predictors and response , where is the matrix whose th row is and is the vector whose th element is . The log likelihood of parameters is then
For a single data sample
To simplify the notation, let's first consider the case of a single data point, ; then we will extend to the general case by additivity.
Gradient
We have
Hence by the chain rule,
Separately, by "Mean and variance of the sufficient statistic," we have . Hence, by "Claim: Expressing in terms of the sufficient statistic," we have
Hessian
Differentiating a second time, by the product rule we obtain
Fisher information
By "Mean and variance of the sufficient statistic," we have
Hence
For multiple data samples
We now extend the case to the general case. Let denote the vector whose th coordinate is the linear response from the th data sample. Let (resp. , resp. ) denote the broadcasted (vectorized) function which applies the scalar-valued function (resp. , resp. ) to each coordinate. Then we have
and
where the fractions denote element-wise division.
Verifying the Formulas Numerically
We now verify the above formula for gradient of the log likelihood numerically using tf.gradients
, and verify the formula for Fisher information with a Monte Carlo estimate using tf.hessians
:
References
[1]: Guo-Xun Yuan, Chia-Hua Ho and Chih-Jen Lin. An Improved GLMNET for L1-regularized Logistic Regression. Journal of Machine Learning Research, 13, 2012. http://www.jmlr.org/papers/volume13/yuan12a/yuan12a.pdf
[2]: skd. Derivation of Soft Thresholding Operator. 2018. https://math.stackexchange.com/q/511106
[3]: Wikipedia Contributors. Proximal gradient methods for learning. Wikipedia, The Free Encyclopedia, 2018. https://en.wikipedia.org/wiki/Proximal_gradient_methods_for_learning
[4]: Yao-Liang Yu. The Proximity Operator. https://www.cs.cmu.edu/~suvrit/teach/yaoliang_proximity.pdf