Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/en-snapshot/probability/examples/Generalized_Linear_Models.ipynb
25118 views
Kernel: Python 3

Licensed under the Apache License, Version 2.0 (the "License");

#@title Licensed under the Apache License, Version 2.0 (the "License"); { display-mode: "form" } # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

In this notebook we introduce Generalized Linear Models via a worked example. We solve this example in two different ways using two algorithms for efficiently fitting GLMs in TensorFlow Probability: Fisher scoring for dense data, and coordinatewise proximal gradient descent for sparse data. We compare the fitted coefficients to the true coefficients and, in the case of coordinatewise proximal gradient descent, to the output of R's similar glmnet algorithm. Finally, we provide further mathematical details and derivations of several key properties of GLMs.

Background

A generalized linear model (GLM) is a linear model (η=xβ\eta = x^\top \beta) wrapped in a transformation (link function) and equipped with a response distribution from an exponential family. The choice of link function and response distribution is very flexible, which lends great expressivity to GLMs. The full details, including a sequential presentation of all the definitions and results building up to GLMs in unambiguous notation, are found in "Derivation of GLM Facts" below. We summarize:

In a GLM, a predictive distribution for the response variable YY is associated with a vector of observed predictors xx. The distribution has the form:

p(yx)=m(y,ϕ)exp(θT(y)A(θ)ϕ)θ:=h(η)η:=xβ\begin{align*} p(y \, |\, x) &= m(y, \phi) \exp\left(\frac{\theta\, T(y) - A(\theta)}{\phi}\right) \\ \theta &:= h(\eta) \\ \eta &:= x^\top \beta \end{align*}

Here β\beta are the parameters ("weights"), ϕ\phi a hyperparameter representing dispersion ("variance"), and mm, hh, TT, AA are characterized by the user-specified model family.

The mean of YY depends on xx by composition of linear response η\eta and (inverse) link function, i.e.:

μ:=g1(η)\mu := g^{-1}(\eta)

where gg is the so-called link function. In TFP the choice of link function and model family are jointly specifed by a tfp.glm.ExponentialFamily subclass. Examples include:

  • tfp.glm.Normal, aka "linear regression"

  • tfp.glm.Bernoulli, aka "logistic regression"

  • tfp.glm.Poisson, aka "Poisson regression"

  • tfp.glm.BernoulliNormalCDF, aka "probit regression".

TFP prefers to name model families according to the distribution over Y rather than the link function since tfp.Distributions are already first-class citizens. If the tfp.glm.ExponentialFamily subclass name contains a second word, this indicates a non-canonical link function.

GLMs have several remarkable properties which permit efficient implementation of the maximum likelihood estimator. Chief among these properties are simple formulas for the gradient of the log-likelihood \ell, and for the Fisher information matrix, which is the expected value of the Hessian of the negative log-likelihood under a re-sampling of the response under the same predictors. I.e.:

β(β;x,y)=xdiag(MeanT(xβ)VarT(xβ))(T(y)MeanT(xβ))EYiGLMxi[β2(β;x,Y)]=xdiag(ϕMeanT(xβ)2VarT(xβ))x\begin{align*} \nabla_\beta\, \ell(\beta\, ;\, \mathbf{x}, \mathbf{y}) &= \mathbf{x}^\top \,\text{diag}\left(\frac{ {\textbf{Mean}_T}'(\mathbf{x} \beta) }{ {\textbf{Var}_T}(\mathbf{x} \beta) }\right) \left(\mathbf{T}(\mathbf{y}) - {\textbf{Mean}_T}(\mathbf{x} \beta)\right) \\ \mathbb{E}_{Y_i \sim \text{GLM} | x_i} \left[ \nabla_\beta^2\, \ell(\beta\, ;\, \mathbf{x}, \mathbf{Y}) \right] &= -\mathbf{x}^\top \,\text{diag}\left( \frac{ \phi\, {\textbf{Mean}_T}'(\mathbf{x} \beta)^2 }{ {\textbf{Var}_T}(\mathbf{x} \beta) }\right)\, \mathbf{x} \end{align*}

where x\mathbf{x} is the matrix whose iith row is the predictor vector for the iith data sample, and y\mathbf{y} is vector whose iith coordinate is the observed response for the iith data sample. Here (loosely speaking), MeanT(η):=E[T(Y)η]{\text{Mean}_T}(\eta) := \mathbb{E}[T(Y)\,|\,\eta] and VarT(η):=Var[T(Y)η]{\text{Var}_T}(\eta) := \text{Var}[T(Y)\,|\,\eta], and boldface denotes vectorization of these functions. Full details of what distributions these expectations and variances are over can be found in "Derivation of GLM Facts" below.

An Example

In this section we briefly describe and showcase two built-in GLM fitting algorithms in TensorFlow Probability: Fisher scoring (tfp.glm.fit) and coordinatewise proximal gradient descent (tfp.glm.fit_sparse).

Synthetic Data Set

Let's pretend to load some training data set.

import numpy as np import pandas as pd import scipy import tensorflow.compat.v2 as tf tf.enable_v2_behavior() import tensorflow_probability as tfp tfd = tfp.distributions
def make_dataset(n, d, link, scale=1., dtype=np.float32): model_coefficients = tfd.Uniform( low=-1., high=np.array(1, dtype)).sample(d, seed=42) radius = np.sqrt(2.) model_coefficients *= radius / tf.linalg.norm(model_coefficients) mask = tf.random.shuffle(tf.range(d)) < int(0.5 * d) model_coefficients = tf.where( mask, model_coefficients, np.array(0., dtype)) model_matrix = tfd.Normal( loc=0., scale=np.array(1, dtype)).sample([n, d], seed=43) scale = tf.convert_to_tensor(scale, dtype) linear_response = tf.linalg.matvec(model_matrix, model_coefficients) if link == 'linear': response = tfd.Normal(loc=linear_response, scale=scale).sample(seed=44) elif link == 'probit': response = tf.cast( tfd.Normal(loc=linear_response, scale=scale).sample(seed=44) > 0, dtype) elif link == 'logit': response = tfd.Bernoulli(logits=linear_response).sample(seed=44) else: raise ValueError('unrecognized true link: {}'.format(link)) return model_matrix, response, model_coefficients, mask

Note: Connect to a local runtime.

In this notebook, we share data between Python and R kernels using local files. To enable this sharing, please use runtimes on the same machine where you have permission to read and write local files.

x, y, model_coefficients_true, _ = [t.numpy() for t in make_dataset( n=int(1e5), d=100, link='probit')] DATA_DIR = '/tmp/glm_example' tf.io.gfile.makedirs(DATA_DIR) with tf.io.gfile.GFile('{}/x.csv'.format(DATA_DIR), 'w') as f: np.savetxt(f, x, delimiter=',') with tf.io.gfile.GFile('{}/y.csv'.format(DATA_DIR), 'w') as f: np.savetxt(f, y.astype(np.int32) + 1, delimiter=',', fmt='%d') with tf.io.gfile.GFile( '{}/model_coefficients_true.csv'.format(DATA_DIR), 'w') as f: np.savetxt(f, model_coefficients_true, delimiter=',')

Without L1 Regularization

The function tfp.glm.fit implements Fisher scoring, which takes as some of its arguments:

  • model_matrix = x\mathbf{x}

  • response = y\mathbf{y}

  • model = callable which, given argument η\boldsymbol{\eta}, returns the triple (MeanT(η),VarT(η),MeanT(η))\left( {\textbf{Mean}_T}(\boldsymbol{\eta}), {\textbf{Var}_T}(\boldsymbol{\eta}), {\textbf{Mean}_T}'(\boldsymbol{\eta}) \right).

We recommend that model be an instance of the tfp.glm.ExponentialFamily class. There are several pre-made implementations available, so for most common GLMs no custom code is necessary.

@tf.function(autograph=False) def fit_model(): model_coefficients, linear_response, is_converged, num_iter = tfp.glm.fit( model_matrix=x, response=y, model=tfp.glm.BernoulliNormalCDF()) log_likelihood = tfp.glm.BernoulliNormalCDF().log_prob(y, linear_response) return (model_coefficients, linear_response, is_converged, num_iter, log_likelihood) [model_coefficients, linear_response, is_converged, num_iter, log_likelihood] = [t.numpy() for t in fit_model()] print(('is_converged: {}\n' ' num_iter: {}\n' ' accuracy: {}\n' ' deviance: {}\n' '||w0-w1||_2 / (1+||w0||_2): {}' ).format( is_converged, num_iter, np.mean((linear_response > 0.) == y), 2. * np.mean(log_likelihood), np.linalg.norm(model_coefficients_true - model_coefficients, ord=2) / (1. + np.linalg.norm(model_coefficients_true, ord=2)) ))
is_converged: True num_iter: 6 accuracy: 0.75241 deviance: -0.992436110973 ||w0-w1||_2 / (1+||w0||_2): 0.0231555201462

Mathematical Details

Fisher scoring is a modification of Newton's method to find the maximum-likelihood estimate

β^:=arg maxβ  (β ; x,y).\hat\beta := \underset{\beta}{\text{arg max}}\ \ \ell(\beta\ ;\ \mathbf{x}, \mathbf{y}).

Vanilla Newton's method, searching for zeros of the gradient of the log-likelihood, would follow the update rule

βNewton(t+1):=β(t)α(β2(β ; x,y))β=β(t)1(β(β ; x,y))β=β(t)\beta^{(t+1)}_{\text{Newton}} := \beta^{(t)} - \alpha \left( \nabla^2_\beta\, \ell(\beta\ ;\ \mathbf{x}, \mathbf{y}) \right)_{\beta = \beta^{(t)}}^{-1} \left( \nabla_\beta\, \ell(\beta\ ;\ \mathbf{x}, \mathbf{y}) \right)_{\beta = \beta^{(t)}}

where α(0,1]\alpha \in (0, 1] is a learning rate used to control the step size.

In Fisher scoring, we replace the Hessian with the negative Fisher information matrix:

β(t+1):=β(t)αEYipOEF(m,T)(θ=h(xiβ(t)),ϕ)[(β2(β ; x,Y))β=β(t)]1(β(β ; x,y))β=β(t)$3mm]\begin{align*} \beta^{(t+1)} &:= \beta^{(t)} - \alpha\, \mathbb{E}_{ Y_i \sim p_{\text{OEF}(m, T)}(\cdot | \theta = h(x_i^\top \beta^{(t)}), \phi) } \left[ \left( \nabla^2_\beta\, \ell(\beta\ ;\ \mathbf{x}, \mathbf{Y}) \right)_{\beta = \beta^{(t)}} \right]^{-1} \left( \nabla_\beta\, \ell(\beta\ ;\ \mathbf{x}, \mathbf{y}) \right)_{\beta = \beta^{(t)}} $3mm] \end{align*}

[Note that here Y=(Yi)i=1n\mathbf{Y} = (Y_i)_{i=1}^{n} is random, whereas y\mathbf{y} is still the vector of observed responses.]

By the formulas in "Fitting GLM Parameters To Data" below, this simplifies to

β(t+1)=β(t)+α(xdiag(ϕMeanT(xβ(t))2VarT(xβ(t)))x)1(xdiag(MeanT(xβ(t))VarT(xβ(t)))(T(y)MeanT(xβ(t)))).\begin{align*} \beta^{(t+1)} &= \beta^{(t)} + \alpha \left( \mathbf{x}^\top \text{diag}\left( \frac{ \phi\, {\textbf{Mean}_T}'(\mathbf{x} \beta^{(t)})^2 }{ {\textbf{Var}_T}(\mathbf{x} \beta^{(t)}) }\right)\, \mathbf{x} \right)^{-1} \left( \mathbf{x}^\top \text{diag}\left(\frac{ {\textbf{Mean}_T}'(\mathbf{x} \beta^{(t)}) }{ {\textbf{Var}_T}(\mathbf{x} \beta^{(t)}) }\right) \left(\mathbf{T}(\mathbf{y}) - {\textbf{Mean}_T}(\mathbf{x} \beta^{(t)})\right) \right). \end{align*}

With L1 Regularization

tfp.glm.fit_sparse implements a GLM fitter more suited to sparse data sets, based on the algorithm in Yuan, Ho and Lin 2012. Its features include:

  • L1 regularization

  • No matrix inversions

  • Few evaluations of the gradient and Hessian.

We first present an example usage of the code. Details of the algorithm are further elaborated in "Algorithm Details for tfp.glm.fit_sparse" below.

model = tfp.glm.Bernoulli() model_coefficients_start = tf.zeros(x.shape[-1], np.float32) @tf.function(autograph=False) def fit_model(): return tfp.glm.fit_sparse( model_matrix=tf.convert_to_tensor(x), response=tf.convert_to_tensor(y), model=model, model_coefficients_start=model_coefficients_start, l1_regularizer=800., l2_regularizer=None, maximum_iterations=10, maximum_full_sweeps_per_iteration=10, tolerance=1e-6, learning_rate=None) model_coefficients, is_converged, num_iter = [t.numpy() for t in fit_model()] coefs_comparison = pd.DataFrame({ 'Learned': model_coefficients, 'True': model_coefficients_true, }) print(('is_converged: {}\n' ' num_iter: {}\n\n' 'Coefficients:').format( is_converged, num_iter)) coefs_comparison
is_converged: True num_iter: 1 Coefficients:

Note that the learned coefficients have the same sparsity pattern as the true coefficients.

# Save the learned coefficients to a file. with tf.io.gfile.GFile('{}/model_coefficients_prox.csv'.format(DATA_DIR), 'w') as f: np.savetxt(f, model_coefficients, delimiter=',')

Compare to R's glmnet

We compare the output of coordinatewise proximal gradient descent to that of R's glmnet, which uses a similar algorithm.

NOTE: To execute this section, you must switch to an R colab runtime.

suppressMessages({ library('glmnet') })
data_dir <- '/tmp/glm_example' x <- as.matrix(read.csv(paste(data_dir, '/x.csv', sep=''), header=FALSE)) y <- as.matrix(read.csv(paste(data_dir, '/y.csv', sep=''), header=FALSE, colClasses='integer'))
fit <- glmnet( x = x, y = y, family = "binomial", # Logistic regression alpha = 1, # corresponds to l1_weight = 1, l2_weight = 0 standardize = FALSE, intercept = FALSE, thresh = 1e-30, type.logistic = "Newton" )
write.csv(as.matrix(coef(fit, 0.008)), paste(data_dir, '/model_coefficients_glmnet.csv', sep=''), row.names=FALSE)

Compare R, TFP and true coefficients (Note: back to Python kernel)

DATA_DIR = '/tmp/glm_example' with tf.io.gfile.GFile('{}/model_coefficients_glmnet.csv'.format(DATA_DIR), 'r') as f: model_coefficients_glmnet = np.loadtxt(f, skiprows=2 # Skip column name and intercept ) with tf.io.gfile.GFile('{}/model_coefficients_prox.csv'.format(DATA_DIR), 'r') as f: model_coefficients_prox = np.loadtxt(f) with tf.io.gfile.GFile( '{}/model_coefficients_true.csv'.format(DATA_DIR), 'r') as f: model_coefficients_true = np.loadtxt(f)
coefs_comparison = pd.DataFrame({ 'TFP': model_coefficients_prox, 'R': model_coefficients_glmnet, 'True': model_coefficients_true, }) coefs_comparison

Algorithm Details for tfp.glm.fit_sparse

We present the algorithm as a sequence of three modifications to Newton's method. In each one, the update rule for β\beta is based on a vector ss and a matrix HH which approximate the gradient and Hessian of the log-likelihood. In step tt, we choose a coordinate j(t)j^{(t)} to change, and we update β\beta according to the update rule:

u(t):=(s(t))j(t)(H(t))j(t),j(t)$3mm]β(t+1):=β(t)αu(t)onehot(j(t))\begin{align*} u^{(t)} &:= \frac{ \left( s^{(t)} \right)_{j^{(t)}} }{ \left( H^{(t)} \right)_{j^{(t)},\, j^{(t)}} } $3mm] \beta^{(t+1)} &:= \beta^{(t)} - \alpha\, u^{(t)} \,\text{onehot}(j^{(t)}) \end{align*}

This update is a Newton-like step with learning rate α\alpha. Except for the final piece (L1 regularization), the modifications below differ only in how they update ss and HH.

Starting point: Coordinatewise Newton's method

In coordinatewise Newton's method, we set ss and HH to the true gradient and Hessian of the log-likelihood:

svanilla(t):=(β(β;x,y))β=β(t)Hvanilla(t):=(β2(β;x,y))β=β(t)\begin{align*} s^{(t)}_{\text{vanilla}} &:= \left( \nabla_\beta\, \ell(\beta \,;\, \mathbf{x}, \mathbf{y}) \right)_{\beta = \beta^{(t)}} \\ H^{(t)}_{\text{vanilla}} &:= \left( \nabla^2_\beta\, \ell(\beta \,;\, \mathbf{x}, \mathbf{y}) \right)_{\beta = \beta^{(t)}} \end{align*}

Fewer evaluations of the gradient and Hessian

The gradient and Hessian of the log-likelihood are often expensive to compute, so it is often worthwhile to approximate them. We can do so as follows:

  • Usually, approximate the Hessian as locally constant and approximate the gradient to first order using the (approximate) Hessian:

Happrox(t+1):=H(t)sapprox(t+1):=s(t)+H(t)(β(t+1)β(t))\begin{align*} H_{\text{approx}}^{(t+1)} &:= H^{(t)} \\ s_{\text{approx}}^{(t+1)} &:= s^{(t)} + H^{(t)} \left( \beta^{(t+1)} - \beta^{(t)} \right) \end{align*}
  • Occasionally, perform a "vanilla" update step as above, setting s(t+1)s^{(t+1)} to the exact gradient and H(t+1)H^{(t+1)} to the exact Hessian of the log-likelihood, evaluated at β(t+1)\beta^{(t+1)}.

Substitute negative Fisher information for Hessian

To further reduce the cost of the vanilla update steps, we can set HH to the negative Fisher information matrix (efficiently computable using the formulas in "Fitting GLM Parameters to Data" below) rather than the exact Hessian:

HFisher(t+1):=EYipOEF(m,T)(θ=h(xiβ(t+1)),ϕ)[(β2(β;x,Y))β=β(t+1)]=xdiag(ϕMeanT(xβ(t+1))2VarT(xβ(t+1)))xsFisher(t+1):=svanilla(t+1)=(xdiag(MeanT(xβ(t+1))VarT(xβ(t+1)))(T(y)MeanT(xβ(t+1))))\begin{align*} H_{\text{Fisher}}^{(t+1)} &:= \mathbb{E}_{Y_i \sim p_{\text{OEF}(m, T)}(\cdot | \theta = h(x_i^\top \beta^{(t+1)}), \phi)} \left[ \left( \nabla_\beta^2\, \ell(\beta\, ;\, \mathbf{x}, \mathbf{Y}) \right)_{\beta = \beta^{(t+1)}} \right] \\ &= -\mathbf{x}^\top \,\text{diag}\left( \frac{ \phi\, {\textbf{Mean}_T}'(\mathbf{x} \beta^{(t+1)})^2 }{ {\textbf{Var}_T}(\mathbf{x} \beta^{(t+1)}) }\right)\, \mathbf{x} \\ s_{\text{Fisher}}^{(t+1)} &:= s_{\text{vanilla}}^{(t+1)} \\ &= \left( \mathbf{x}^\top \,\text{diag}\left(\frac{ {\textbf{Mean}_T}'(\mathbf{x} \beta^{(t+1)}) }{ {\textbf{Var}_T}(\mathbf{x} \beta^{(t+1)}) }\right) \left(\mathbf{T}(\mathbf{y}) - {\textbf{Mean}_T}(\mathbf{x} \beta^{(t+1)})\right) \right) \end{align*}

L1 Regularization via Proximal Gradient Descent

To incorporate L1 regularization, we replace the update rule

β(t+1):=β(t)αu(t)onehot(j(t))\beta^{(t+1)} := \beta^{(t)} - \alpha\, u^{(t)} \,\text{onehot}(j^{(t)})

with the more general update rule

γ(t):=αrL1(H(t))j(t),j(t)$2mm](βreg(t+1))j:={βj(t+1)if jj(t)SoftThreshold(βj(t)αu(t), γ(t))if j=j(t)\begin{align*} \gamma^{(t)} &:= -\frac{\alpha\, r_{\text{L1}}}{\left(H^{(t)}\right)_{j^{(t)},\, j^{(t)}}} $2mm] \left(\beta_{\text{reg}}^{(t+1)}\right)_j &:= \begin{cases} \beta^{(t+1)}_j &\text{if } j \neq j^{(t)} \\ \text{SoftThreshold} \left( \beta^{(t)}_j - \alpha\, u^{(t)} ,\ \gamma^{(t)} \right) &\text{if } j = j^{(t)} \end{cases} \end{align*}

where rL1>0r_{\text{L1}} > 0 is a supplied constant (the L1 regularization coefficient) and SoftThreshold\text{SoftThreshold} is the soft thresholding operator, defined by

SoftThreshold(β,γ):={β+γif β<γ0if γβγβγif β>γ.\text{SoftThreshold}(\beta, \gamma) := \begin{cases} \beta + \gamma &\text{if } \beta < -\gamma \\ 0 &\text{if } -\gamma \leq \beta \leq \gamma \\ \beta - \gamma &\text{if } \beta > \gamma. \end{cases}

This update rule has the following two inspirational properties, which we explain below:

  1. In the limiting case rL10r_{\text{L1}} \to 0 (i.e., no L1 regularization), this update rule is identical to the original update rule.

  2. This update rule can be interpreted as applying a proximity operator whose fixed point is the solution to the L1-regularized minimization problem

arg minββ(t)span{onehot(j(t))}((β;x,y)+rL1β1).\underset{\beta - \beta^{(t)} \in \text{span}\{ \text{onehot}(j^{(t)}) \}}{\text{arg min}} \left( -\ell(\beta \,;\, \mathbf{x}, \mathbf{y}) + r_{\text{L1}} \left\lVert \beta \right\rVert_1 \right).

Degenerate case rL1=0r_{\text{L1}} = 0 recovers the original update rule

To see (1), note that if rL1=0r_{\text{L1}} = 0 then γ(t)=0\gamma^{(t)} = 0, hence

(βreg(t+1))j(t)=SoftThreshold(βj(t)(t)αu(t), 0)=βj(t)(t)αu(t).\begin{align*} \left(\beta_{\text{reg}}^{(t+1)}\right)_{j^{(t)}} &= \text{SoftThreshold} \left( \beta^{(t)}_{j^{(t)}} - \alpha\, u^{(t)} ,\ 0 \right) \\ &= \beta^{(t)}_{j^{(t)}} - \alpha\, u^{(t)}. \end{align*}

Hence

βreg(t+1)=β(t)αu(t)onehot(j(t))=β(t+1).\begin{align*} \beta_{\text{reg}}^{(t+1)} &= \beta^{(t)} - \alpha\, u^{(t)} \,\text{onehot}(j^{(t)}) \\ &= \beta^{(t+1)}. \end{align*}

Proximity operator whose fixed point is the regularized MLE

To see (2), first note (see Wikipedia) that for any γ>0\gamma > 0, the update rule

(βexact-prox,γ(t+1))j(t):=proxγ1(βj(t)(t)+γrL1((β(β;x,y))β=β(t))j(t))\left(\beta_{\text{exact-prox}, \gamma}^{(t+1)}\right)_{j^{(t)}} := \text{prox}_{\gamma \lVert \cdot \rVert_1} \left( \beta^{(t)}_{j^{(t)}} + \frac{\gamma}{r_{\text{L1}}} \left( \left( \nabla_\beta\, \ell(\beta \,;\, \mathbf{x}, \mathbf{y}) \right)_{\beta = \beta^{(t)}} \right)_{j^{(t)}} \right)

satisfies (2), where prox\text{prox} is the proximity operator (see Yu, where this operator is denoted P\mathsf{P}). The right-hand side of the above equation is computed here:

(βexact-prox,γ(t+1))j(t)=SoftThreshold(βj(t)(t)+γrL1((β(β;x,y))β=β(t))j(t), γ).\left(\beta_{\text{exact-prox}, \gamma}^{(t+1)}\right)_{j^{(t)}} = \text{SoftThreshold} \left( \beta^{(t)}_{j^{(t)}} + \frac{\gamma}{r_{\text{L1}}} \left( \left( \nabla_\beta\, \ell(\beta \,;\, \mathbf{x}, \mathbf{y}) \right)_{\beta = \beta^{(t)}} \right)_{j^{(t)}} ,\ \gamma \right).

In particular, setting γ=γ(t)=αrL1(H(t))j(t),j(t)\gamma = \gamma^{(t)} = -\frac{\alpha\, r_{\text{L1}}}{\left(H^{(t)}\right)_{j^{(t)}, j^{(t)}}} (note that γ(t)>0\gamma^{(t)} > 0 as long as the negative log-likelihood is convex), we obtain the update rule

(βexact-prox,γ(t)(t+1))j(t)=SoftThreshold(βj(t)(t)α((β(β;x,y))β=β(t))j(t)(H(t))j(t),j(t), γ(t)).\left(\beta_{\text{exact-prox}, \gamma^{(t)}}^{(t+1)}\right)_{j^{(t)}} = \text{SoftThreshold} \left( \beta^{(t)}_{j^{(t)}} - \alpha \frac{ \left( \left( \nabla_\beta\, \ell(\beta \,;\, \mathbf{x}, \mathbf{y}) \right)_{\beta = \beta^{(t)}} \right)_{j^{(t)}} }{ \left(H^{(t)}\right)_{j^{(t)}, j^{(t)}} } ,\ \gamma^{(t)} \right).

We then replace the exact gradient (β(β;x,y))β=β(t)\left( \nabla_\beta\, \ell(\beta \,;\, \mathbf{x}, \mathbf{y}) \right)_{\beta = \beta^{(t)}} with its approximation s(t)s^{(t)}, obtaining

(βexact-prox,γ(t)(t+1))j(t)SoftThreshold(βj(t)(t)α(s(t))j(t)(H(t))j(t),j(t), γ(t))=SoftThreshold(βj(t)(t)αu(t), γ(t)).\begin{align*} \left(\beta_{\text{exact-prox}, \gamma^{(t)}}^{(t+1)}\right)_{j^{(t)}} &\approx \text{SoftThreshold} \left( \beta^{(t)}_{j^{(t)}} - \alpha \frac{ \left(s^{(t)}\right)_{j^{(t)}} }{ \left(H^{(t)}\right)_{j^{(t)}, j^{(t)}} } ,\ \gamma^{(t)} \right) \\ &= \text{SoftThreshold} \left( \beta^{(t)}_{j^{(t)}} - \alpha\, u^{(t)} ,\ \gamma^{(t)} \right). \end{align*}

Hence

βexact-prox,γ(t)(t+1)βreg(t+1).\beta_{\text{exact-prox}, \gamma^{(t)}}^{(t+1)} \approx \beta_{\text{reg}}^{(t+1)}.

Derivation of GLM Facts

In this section we state in full detail and derive the results about GLMs that are used in the preceding sections. Then, we use TensorFlow's gradients to numerically verify the derived formulas for gradient of the log-likelihood and Fisher information.

Score and Fisher information

Consider a family of probability distributions parameterized by parameter vector θ\theta, having probability densities {p(θ)}θT\left\{p(\cdot | \theta)\right\}_{\theta \in \mathcal{T}}. The score of an outcome yy at parameter vector θ0\theta_0 is defined to be the gradient of the log likelihood of yy (evaluated at θ0\theta_0), that is,

score(y,θ0):=[θlogp(yθ)]θ=θ0.\text{score}(y, \theta_0) := \left[\nabla_\theta\, \log p(y | \theta)\right]_{\theta=\theta_0}.

Claim: Expectation of the score is zero

Under mild regularity conditions (permitting us to pass differentiation under the integral),

EYp(θ=θ0)[score(Y,θ0)]=0.\mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[\text{score}(Y, \theta_0)\right] = 0.

Proof

We have

EYp(θ=θ0)[score(Y,θ0)]:=EYp(θ=θ0)[(θlogp(Yθ))θ=θ0]=(1)EYp(θ=θ0)[(θp(Yθ))θ=θ0p(Yθ=θ0)]=(2)Y[(θp(yθ))θ=θ0p(yθ=θ0)]p(yθ=θ0)dy=Y(θp(yθ))θ=θ0dy=(3)[θ(Yp(yθ)dy)]θ=θ0=(4)[θ1]θ=θ0=0,\begin{align*} \mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[\text{score}(Y, \theta_0)\right] &:=\mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[\left(\nabla_\theta \log p(Y|\theta)\right)_{\theta=\theta_0}\right] \\ &\stackrel{\text{(1)}}{=} \mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[\frac{\left(\nabla_\theta p(Y|\theta)\right)_{\theta=\theta_0}}{p(Y|\theta=\theta_0)}\right] \\ &\stackrel{\text{(2)}}{=} \int_{\mathcal{Y}} \left[\frac{\left(\nabla_\theta p(y|\theta)\right)_{\theta=\theta_0}}{p(y|\theta=\theta_0)}\right] p(y | \theta=\theta_0)\, dy \\ &= \int_{\mathcal{Y}} \left(\nabla_\theta p(y|\theta)\right)_{\theta=\theta_0}\, dy \\ &\stackrel{\text{(3)}}{=} \left[\nabla_\theta \left(\int_{\mathcal{Y}} p(y|\theta)\, dy\right) \right]_{\theta=\theta_0} \\ &\stackrel{\text{(4)}}{=} \left[\nabla_\theta\, 1 \right]_{\theta=\theta_0} \\ &= 0, \end{align*}

where we have used: (1) chain rule for differentiation, (2) definition of expectation, (3) passing differentiation under the integral sign (using the regularity conditions), (4) the integral of a probability density is 1.

Claim (Fisher information): Variance of the score equals negative expected Hessian of the log likelihood

Under mild regularity conditions (permitting us to pass differentiation under the integral),

EYp(θ=θ0)[score(Y,θ0)score(Y,θ0)]=EYp(θ=θ0)[(θ2logp(Yθ))θ=θ0]\mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[ \text{score}(Y, \theta_0) \text{score}(Y, \theta_0)^\top \right] = -\mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[ \left(\nabla_\theta^2 \log p(Y | \theta)\right)_{\theta=\theta_0} \right]

where θ2F\nabla_\theta^2 F denotes the Hessian matrix, whose (i,j)(i, j) entry is 2Fθiθj\frac{\partial^2 F}{\partial \theta_i \partial \theta_j}.

The left-hand side of this equation is called the Fisher information of the family {p(θ)}θT\left\{p(\cdot | \theta)\right\}_{\theta \in \mathcal{T}} at parameter vector θ0\theta_0.

Proof of claim

We have

EYp(θ=θ0)[(θ2logp(Yθ))θ=θ0]=(1)EYp(θ=θ0)[(θθp(Yθ)p(Yθ))θ=θ0]=(2)EYp(θ=θ0)[(θ2p(Yθ))θ=θ0p(Yθ=θ0)((θp(Yθ))θ=θ0p(Yθ=θ0))((θp(Yθ))θ=θ0p(Yθ=θ0))]=(3)EYp(θ=θ0)[(θ2p(Yθ))θ=θ0p(Yθ=θ0)score(Y,θ0)score(Y,θ0)],\begin{align*} \mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[ \left(\nabla_\theta^2 \log p(Y | \theta)\right)_{\theta=\theta_0} \right] &\stackrel{\text{(1)}}{=} \mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[ \left(\nabla_\theta^\top \frac{ \nabla_\theta p(Y | \theta) }{ p(Y|\theta) }\right)_{\theta=\theta_0} \right] \\ &\stackrel{\text{(2)}}{=} \mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[ \frac{ \left(\nabla^2_\theta p(Y | \theta)\right)_{\theta=\theta_0} }{ p(Y|\theta=\theta_0) } - \left(\frac{ \left(\nabla_\theta\, p(Y|\theta)\right)_{\theta=\theta_0} }{ p(Y|\theta=\theta_0) }\right) \left(\frac{ \left(\nabla_\theta\, p(Y|\theta)\right)_{\theta=\theta_0} }{ p(Y|\theta=\theta_0) }\right)^\top \right] \\ &\stackrel{\text{(3)}}{=} \mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[ \frac{ \left(\nabla^2_\theta p(Y | \theta)\right)_{\theta=\theta_0} }{ p(Y|\theta=\theta_0) } - \text{score}(Y, \theta_0) \,\text{score}(Y, \theta_0)^\top \right], \end{align*}

where we have used (1) chain rule for differentiation, (2) quotient rule for differentiation, (3) chain rule again, in reverse.

To complete the proof, it suffices to show that

EYp(θ=θ0)[(θ2p(Yθ))θ=θ0p(Yθ=θ0)]=?0.\mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[ \frac{ \left(\nabla^2_\theta p(Y | \theta)\right)_{\theta=\theta_0} }{ p(Y|\theta=\theta_0) } \right] \stackrel{\text{?}}{=} 0.

To do that, we pass differentiation under the integral sign twice:

EYp(θ=θ0)[(θ2p(Yθ))θ=θ0p(Yθ=θ0)]=Y[(θ2p(yθ))θ=θ0p(yθ=θ0)]p(yθ=θ0)dy=Y(θ2p(yθ))θ=θ0dy=[θ2(Yp(yθ)dy)]θ=θ0=[θ21]θ=θ0=0.\begin{align*} \mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)}\left[ \frac{ \left(\nabla^2_\theta p(Y | \theta)\right)_{\theta=\theta_0} }{ p(Y|\theta=\theta_0) } \right] &= \int_{\mathcal{Y}} \left[ \frac{ \left(\nabla^2_\theta p(y | \theta)\right)_{\theta=\theta_0} }{ p(y|\theta=\theta_0) } \right] \, p(y | \theta=\theta_0)\, dy \\ &= \int_{\mathcal{Y}} \left(\nabla^2_\theta p(y | \theta)\right)_{\theta=\theta_0} \, dy \\ &= \left[ \nabla_\theta^2 \left( \int_{\mathcal{Y}} p(y | \theta) \, dy \right) \right]_{\theta=\theta_0} \\ &= \left[ \nabla_\theta^2 \, 1 \right]_{\theta=\theta_0} \\ &= 0. \end{align*}

Lemma about the derivative of the log partition function

If aa, bb and cc are scalar-valued functions, cc twice differentiable, such that the family of distributions {p(θ)}θT\left\{p(\cdot | \theta)\right\}_{\theta \in \mathcal{T}} defined by

p(yθ)=a(y)exp(b(y)θc(θ))p(y|\theta) = a(y) \exp\left(b(y)\, \theta - c(\theta)\right)

satisfies the mild regularity conditions that permit passing differentiation with respect to θ\theta under an integral with respect to yy, then

EYp(θ=θ0)[b(Y)]=c(θ0)\mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)} \left[ b(Y) \right] = c'(\theta_0)

and

VarYp(θ=θ0)[b(Y)]=c(θ0).\text{Var}_{Y \sim p(\cdot | \theta=\theta_0)} \left[ b(Y) \right] = c''(\theta_0).

(Here ' denotes differentiation, so cc' and cc'' are the first and second derivatives of cc. )

Proof

For this family of distributions, we have score(y,θ0)=b(y)c(θ0)\text{score}(y, \theta_0) = b(y) - c'(\theta_0). The first equation then follows from the fact that EYp(θ=θ0)[score(y,θ0)]=0\mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)} \left[ \text{score}(y, \theta_0) \right] = 0. Next, we have

VarYp(θ=θ0)[b(Y)]=EYp(θ=θ0)[(b(Y)c(θ0))2]=the one entry of EYp(θ=θ0)[score(y,θ0)score(y,θ0)]=the one entry of EYp(θ=θ0)[(θ2logp(θ))θ=θ0]=EYp(θ=θ0)[c(θ0)]=c(θ0).\begin{align*} \text{Var}_{Y \sim p(\cdot | \theta=\theta_0)} \left[ b(Y) \right] &= \mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)} \left[ \left(b(Y) - c'(\theta_0)\right)^2 \right] \\ &= \text{the one entry of } \mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)} \left[ \text{score}(y, \theta_0) \text{score}(y, \theta_0)^\top \right] \\ &= \text{the one entry of } -\mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)} \left[ \left(\nabla_\theta^2 \log p(\cdot | \theta)\right)_{\theta=\theta_0} \right] \\ &= -\mathbb{E}_{Y \sim p(\cdot | \theta=\theta_0)} \left[ -c''(\theta_0) \right] \\ &= c''(\theta_0). \end{align*}

Overdispersed Exponential Family

A (scalar) overdispersed exponential family is a family of distributions whose densities take the form

pOEF(m,T)(yθ,ϕ)=m(y,ϕ)exp(θT(y)A(θ)ϕ),p_{\text{OEF}(m, T)}(y\, |\, \theta, \phi) = m(y, \phi) \exp\left(\frac{\theta\, T(y) - A(\theta)}{\phi}\right),

where mm and TT are known scalar-valued functions, and θ\theta and ϕ\phi are scalar parameters.

[Note that AA is overdetermined: for any ϕ0\phi_0, the function AA is completely determined by the constraint that pOEF(m,T)(y  θ,ϕ=ϕ0)dy=1\int p_{\text{OEF}(m, T)}(y\ |\ \theta, \phi=\phi_0)\, dy = 1 for all θ\theta. The AA's produced by different values of ϕ0\phi_0 must all be the same, which places a constraint on the functions mm and TT.]

Mean and variance of the sufficient statistic

Under the same conditions as "Lemma about the derivative of the log partition function," we have

EYpOEF(m,T)(θ,ϕ)[T(Y)]=A(θ)\mathbb{E}_{Y \sim p_{\text{OEF}(m, T)}(\cdot | \theta, \phi)} \left[ T(Y) \right] = A'(\theta)

and

VarYpOEF(m,T)(θ,ϕ)[T(Y)]=ϕA(θ).\text{Var}_{Y \sim p_{\text{OEF}(m, T)}(\cdot | \theta, \phi)} \left[ T(Y) \right] = \phi A''(\theta).

Proof

By "Lemma about the derivative of the log partition function," we have

EYpOEF(m,T)(θ,ϕ)[T(Y)ϕ]=A(θ)ϕ\mathbb{E}_{Y \sim p_{\text{OEF}(m, T)}(\cdot | \theta, \phi)} \left[ \frac{T(Y)}{\phi} \right] = \frac{A'(\theta)}{\phi}

and

VarYpOEF(m,T)(θ,ϕ)[T(Y)ϕ]=A(θ)ϕ.\text{Var}_{Y \sim p_{\text{OEF}(m, T)}(\cdot | \theta, \phi)} \left[ \frac{T(Y)}{\phi} \right] = \frac{A''(\theta)}{\phi}.

The result then follows from the fact that expectation is linear (E[aX]=aE[X]\mathbb{E}[aX] = a\mathbb{E}[X]) and variance is degree-2 homogeneous (Var[aX]=a2Var[X]\text{Var}[aX] = a^2 \,\text{Var}[X]).

Generalized Linear Model

In a generalized linear model, a predictive distribution for the response variable YY is associated with a vector of observed predictors xx. The distribution is a member of an overdispersed exponential family, and the parameter θ\theta is replaced by h(η)h(\eta) where hh is a known function, η:=xβ\eta := x^\top \beta is the so-called linear response, and β\beta is a vector of parameters (regression coefficients) to be learned. In general the dispersion parameter ϕ\phi could be learned too, but in our setup we will treat ϕ\phi as known. So our setup is

YpOEF(m,T)(θ=h(η),ϕ)Y \sim p_{\text{OEF}(m, T)}(\cdot\, |\, \theta = h(\eta), \phi)

where the model structure is characterized by the distribution pOEF(m,T)p_{\text{OEF}(m, T)} and the function hh which converts linear response to parameters.

Traditionally, the mapping from linear response η\eta to mean μ:=EYpOEF(m,T)(θ=h(η),ϕ)[Y]\mu := \mathbb{E}_{Y \sim p_{\text{OEF}(m, T)}(\cdot\, |\, \theta = h(\eta), \phi)}\left[ Y\right] is denoted

μ=g1(η).\mu = g^{-1}(\eta).

This mapping is required to be one-to-one, and its inverse, gg, is called the link function for this GLM. Typically, one describes a GLM by naming its link function and its family of distributions -- e.g., a "GLM with Bernoulli distribution and logit link function" (also known as a logistic regression model). In order to fully characterize the GLM, the function hh must also be specified. If hh is the identity, then gg is said to be the canonical link function.

Claim: Expressing hh' in terms of the sufficient statistic

Define

MeanT(η):=EYpOEF(m,T)(θ=h(η),ϕ)[T(Y)]{\text{Mean}_T}(\eta) := \mathbb{E}_{Y \sim p_{\text{OEF}(m, T)}(\cdot | \theta = h(\eta), \phi)} \left[ T(Y) \right]

and

VarT(η):=VarYpOEF(m,T)(θ=h(η),ϕ)[T(Y)].{\text{Var}_T}(\eta) := \text{Var}_{Y \sim p_{\text{OEF}(m, T)}(\cdot | \theta = h(\eta), \phi)} \left[ T(Y) \right].

Then we have

h(η)=ϕMeanT(η)VarT(η).h'(\eta) = \frac{\phi\, {\text{Mean}_T}'(\eta)}{{\text{Var}_T}(\eta)}.

Proof

By "Mean and variance of the sufficient statistic," we have

MeanT(η)=A(h(η)).{\text{Mean}_T}(\eta) = A'(h(\eta)).

Differentiating with the chain rule, we obtain MeanT(η)=A(h(η))h(η), {\text{Mean}_T}'(\eta) = A''(h(\eta))\, h'(\eta),

and by "Mean and variance of the sufficient statistic,"

=1ϕVarT(η) h(η).\cdots = \frac{1}{\phi} {\text{Var}_T}(\eta)\ h'(\eta).

The conclusion follows.

Fitting GLM Parameters to Data

The properties derived above lend themselves very well to fitting GLM parameters β\beta to a data set. Quasi-Newton methods such as Fisher scoring rely on the gradient of the log likelihood and the Fisher information, which we now show can be computed especially efficiently for a GLM.

Suppose we have observed predictor vectors xix_i and associated scalar responses yiy_i. In matrix form, we'll say we have observed predictors x\mathbf{x} and response y\mathbf{y}, where x\mathbf{x} is the matrix whose iith row is xix_i^\top and y\mathbf{y} is the vector whose iith element is yiy_i. The log likelihood of parameters β\beta is then

(β;x,y)=i=1NlogpOEF(m,T)(yiθ=h(xiβ),ϕ).\ell(\beta\, ;\, \mathbf{x}, \mathbf{y}) = \sum_{i=1}^{N} \log p_{\text{OEF}(m, T)}(y_i\, |\, \theta = h(x_i^\top \beta), \phi).

For a single data sample

To simplify the notation, let's first consider the case of a single data point, N=1N=1; then we will extend to the general case by additivity.

Gradient

We have

(β;x,y)=logpOEF(m,T)(yθ=h(xβ),ϕ)=logm(y,ϕ)+θT(y)A(θ)ϕ,where θ=h(xβ).\begin{align*} \ell(\beta\, ;\, x, y) &= \log p_{\text{OEF}(m, T)}(y\, |\, \theta = h(x^\top \beta), \phi) \\ &= \log m(y, \phi) + \frac{\theta\, T(y) - A(\theta)}{\phi}, \quad\text{where}\ \theta = h(x^\top \beta). \end{align*}

Hence by the chain rule,

β(β;x,y)=T(y)A(θ)ϕh(xβ)x.\nabla_\beta \ell(\beta\, ; \, x, y) = \frac{T(y) - A'(\theta)}{\phi}\, h'(x^\top \beta)\, x.

Separately, by "Mean and variance of the sufficient statistic," we have A(θ)=MeanT(xβ)A'(\theta) = {\text{Mean}_T}(x^\top \beta). Hence, by "Claim: Expressing hh' in terms of the sufficient statistic," we have

=(T(y)MeanT(xβ))MeanT(xβ)VarT(xβ)x.\cdots = \left(T(y) - {\text{Mean}_T}(x^\top \beta)\right) \frac{{\text{Mean}_T}'(x^\top \beta)}{{\text{Var}_T}(x^\top \beta)} \,x.

Hessian

Differentiating a second time, by the product rule we obtain

β2(β;x,y)=[A(h(xβ))h(xβ)]h(xβ)xx+[T(y)A(h(xβ))]h(xβ)xx]=(MeanT(xβ)h(xβ)+[T(y)A(h(xβ))])xx.\begin{align*} \nabla_\beta^2 \ell(\beta\, ;\, x, y) &= \left[ -A''(h(x^\top \beta))\, h'(x^\top \beta) \right] h'(x^\top \beta)\, x x^\top + \left[ T(y) - A'(h(x^\top \beta)) \right] h''(x^\top \beta)\, xx^\top ] \\ &= \left( -{\text{Mean}_T}'(x^\top \beta)\, h'(x^\top \beta) + \left[T(y) - A'(h(x^\top \beta))\right] \right)\, x x^\top. \end{align*}

Fisher information

By "Mean and variance of the sufficient statistic," we have

EYpOEF(m,T)(θ=h(xβ),ϕ)[T(y)A(h(xβ))]=0.\mathbb{E}_{Y \sim p_{\text{OEF}(m, T)}(\cdot | \theta = h(x^\top \beta), \phi)} \left[ T(y) - A'(h(x^\top \beta)) \right] = 0.

Hence

EYpOEF(m,T)(θ=h(xβ),ϕ)[β2(β;x,y)]=MeanT(xβ)h(xβ)xx=ϕMeanT(xβ)2VarT(xβ)xx.\begin{align*} \mathbb{E}_{Y \sim p_{\text{OEF}(m, T)}(\cdot | \theta = h(x^\top \beta), \phi)} \left[ \nabla_\beta^2 \ell(\beta\, ;\, x, y) \right] &= -{\text{Mean}_T}'(x^\top \beta)\, h'(x^\top \beta) x x^\top \\ &= -\frac{\phi\, {\text{Mean}_T}'(x^\top \beta)^2}{{\text{Var}_T}(x^\top \beta)}\, x x^\top. \end{align*}

For multiple data samples

We now extend the N=1N=1 case to the general case. Let η:=xβ\boldsymbol{\eta} := \mathbf{x} \beta denote the vector whose iith coordinate is the linear response from the iith data sample. Let T\mathbf{T} (resp. MeanT{\textbf{Mean}_T}, resp. VarT{\textbf{Var}_T}) denote the broadcasted (vectorized) function which applies the scalar-valued function TT (resp. MeanT{\text{Mean}_T}, resp. VarT{\text{Var}_T}) to each coordinate. Then we have

β(β;x,y)=i=1Nβ(β;xi,yi)=i=1N(T(y)MeanT(xiβ))MeanT(xiβ)VarT(xiβ)xi=xdiag(MeanT(xβ)VarT(xβ))(T(y)MeanT(xβ))\begin{align*} \nabla_\beta \ell(\beta\, ;\, \mathbf{x}, \mathbf{y}) &= \sum_{i=1}^{N} \nabla_\beta \ell(\beta\, ;\, x_i, y_i) \\ &= \sum_{i=1}^{N} \left(T(y) - {\text{Mean}_T}(x_i^\top \beta)\right) \frac{{\text{Mean}_T}'(x_i^\top \beta)}{{\text{Var}_T}(x_i^\top \beta)} \, x_i \\ &= \mathbf{x}^\top \,\text{diag}\left(\frac{ {\textbf{Mean}_T}'(\mathbf{x} \beta) }{ {\textbf{Var}_T}(\mathbf{x} \beta) }\right) \left(\mathbf{T}(\mathbf{y}) - {\textbf{Mean}_T}(\mathbf{x} \beta)\right) \\ \end{align*}

and

EYipOEF(m,T)(θ=h(xiβ),ϕ)[β2(β;x,Y)]=i=1NEYipOEF(m,T)(θ=h(xiβ),ϕ)[β2(β;xi,Yi)]=i=1NϕMeanT(xiβ)2VarT(xiβ)xixi=xdiag(ϕMeanT(xβ)2VarT(xβ))x,\begin{align*} \mathbb{E}_{Y_i \sim p_{\text{OEF}(m, T)}(\cdot | \theta = h(x_i^\top \beta), \phi)} \left[ \nabla_\beta^2 \ell(\beta\, ;\, \mathbf{x}, \mathbf{Y}) \right] &= \sum_{i=1}^{N} \mathbb{E}_{Y_i \sim p_{\text{OEF}(m, T)}(\cdot | \theta = h(x_i^\top \beta), \phi)} \left[ \nabla_\beta^2 \ell(\beta\, ;\, x_i, Y_i) \right] \\ &= \sum_{i=1}^{N} -\frac{\phi\, {\text{Mean}_T}'(x_i^\top \beta)^2}{{\text{Var}_T}(x_i^\top \beta)}\, x_i x_i^\top \\ &= -\mathbf{x}^\top \,\text{diag}\left( \frac{ \phi\, {\textbf{Mean}_T}'(\mathbf{x} \beta)^2 }{ {\textbf{Var}_T}(\mathbf{x} \beta) }\right)\, \mathbf{x}, \end{align*}

where the fractions denote element-wise division.

Verifying the Formulas Numerically

We now verify the above formula for gradient of the log likelihood numerically using tf.gradients, and verify the formula for Fisher information with a Monte Carlo estimate using tf.hessians:

def VerifyGradientAndFIM(): model = tfp.glm.BernoulliNormalCDF() model_matrix = np.array([[1., 5, -2], [8, -1, 8]]) def _naive_grad_and_hessian_loss_fn(x, response): # Computes gradient and Hessian of negative log likelihood using autodiff. predicted_linear_response = tf.linalg.matvec(model_matrix, x) log_probs = model.log_prob(response, predicted_linear_response) grad_loss = tf.gradients(-log_probs, [x])[0] hessian_loss = tf.hessians(-log_probs, [x])[0] return [grad_loss, hessian_loss] def _grad_neg_log_likelihood_and_fim_fn(x, response): # Computes gradient of negative log likelihood and Fisher information matrix # using the formulas above. predicted_linear_response = tf.linalg.matvec(model_matrix, x) mean, variance, grad_mean = model(predicted_linear_response) v = (response - mean) * grad_mean / variance grad_log_likelihood = tf.linalg.matvec(model_matrix, v, adjoint_a=True) w = grad_mean**2 / variance fisher_info = tf.linalg.matmul( model_matrix, w[..., tf.newaxis] * model_matrix, adjoint_a=True) return [-grad_log_likelihood, fisher_info] @tf.function(autograph=False) def compute_grad_hessian_estimates(): # Monte Carlo estimate of E[Hessian(-LogLikelihood)], where the expectation is # as written in "Claim (Fisher information)" above. num_trials = 20 trial_outputs = [] np.random.seed(10) model_coefficients_ = np.random.random(size=(model_matrix.shape[1],)) model_coefficients = tf.convert_to_tensor(model_coefficients_) for _ in range(num_trials): # Sample from the distribution of `model` response = np.random.binomial( 1, scipy.stats.norm().cdf(np.matmul(model_matrix, model_coefficients_)) ).astype(np.float64) trial_outputs.append( list(_naive_grad_and_hessian_loss_fn(model_coefficients, response)) + list( _grad_neg_log_likelihood_and_fim_fn(model_coefficients, response)) ) naive_grads = tf.stack( list(naive_grad for [naive_grad, _, _, _] in trial_outputs), axis=0) fancy_grads = tf.stack( list(fancy_grad for [_, _, fancy_grad, _] in trial_outputs), axis=0) average_hess = tf.reduce_mean(tf.stack( list(hess for [_, hess, _, _] in trial_outputs), axis=0), axis=0) [_, _, _, fisher_info] = trial_outputs[0] return naive_grads, fancy_grads, average_hess, fisher_info naive_grads, fancy_grads, average_hess, fisher_info = [ t.numpy() for t in compute_grad_hessian_estimates()] print("Coordinatewise relative error between naively computed gradients and" " formula-based gradients (should be zero):\n{}\n".format( (naive_grads - fancy_grads) / naive_grads)) print("Coordinatewise relative error between average of naively computed" " Hessian and formula-based FIM (should approach zero as num_trials" " -> infinity):\n{}\n".format( (average_hess - fisher_info) / average_hess)) VerifyGradientAndFIM()
Coordinatewise relative error between naively computed gradients and formula-based gradients (should be zero): [[2.08845965e-16 1.67076772e-16 2.08845965e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [2.08845965e-16 1.67076772e-16 2.08845965e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [2.08845965e-16 1.67076772e-16 2.08845965e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [2.08845965e-16 1.67076772e-16 2.08845965e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [2.08845965e-16 1.67076772e-16 2.08845965e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [2.08845965e-16 1.67076772e-16 2.08845965e-16] [1.96118673e-16 3.13789877e-16 1.96118673e-16] [2.08845965e-16 1.67076772e-16 2.08845965e-16]] Coordinatewise relative error between average of naively computed Hessian and formula-based FIM (should approach zero as num_trials -> infinity): [[0.00072369 0.00072369 0.00072369] [0.00072369 0.00072369 0.00072369] [0.00072369 0.00072369 0.00072369]]

References

[1]: Guo-Xun Yuan, Chia-Hua Ho and Chih-Jen Lin. An Improved GLMNET for L1-regularized Logistic Regression. Journal of Machine Learning Research, 13, 2012. http://www.jmlr.org/papers/volume13/yuan12a/yuan12a.pdf

[2]: skd. Derivation of Soft Thresholding Operator. 2018. https://math.stackexchange.com/q/511106

[3]: Wikipedia Contributors. Proximal gradient methods for learning. Wikipedia, The Free Encyclopedia, 2018. https://en.wikipedia.org/wiki/Proximal_gradient_methods_for_learning

[4]: Yao-Liang Yu. The Proximity Operator. https://www.cs.cmu.edu/~suvrit/teach/yaoliang_proximity.pdf