GitHub Repository: probml/pyprobml
Path: blob/master/notebooks/book1/10/logreg_pytorch.ipynb
¹¹⁹² views

Kernel: Python 3

Logistic regression using PyTorch

We show how to fit a logistic regression model using PyTorch. The log likelihood for this model is convex, so we can compute the globally optimal MLE. This makes it easy to compare to sklearn (and other implementations).

In [2]:

import sklearn
import scipy
import scipy.optimize
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
import itertools
import time
from functools import partial

import os

import numpy as np
from scipy.special import logsumexp

np.set_printoptions(precision=3)

np.set_printoptions(formatter={"float": lambda x: "{0:0.3f}".format(x)})

In [3]:

import torch
import torch.nn as nn
import torchvision

print("torch version {}".format(torch.__version__))
if torch.cuda.is_available():
    print(torch.cuda.get_device_name(0))
    print("current device {}".format(torch.cuda.current_device()))
else:
    print("Torch cannot find GPU")


def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)


use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
# torch.backends.cudnn.benchmark = True

Out[3]:

torch version 1.8.0+cu101
Tesla P100-PCIE-16GB
current device 0

Logistic regression using sklearn

We fit binary logistic regresion on the Iris dataset.

In [4]:

# Fit the model usign sklearn

import sklearn.datasets
from sklearn.model_selection import train_test_split

iris = sklearn.datasets.load_iris()
X = iris["data"]
y = (iris["target"] == 2).astype(np.int)  # 1 if Iris-Virginica, else 0'
N, D = X.shape  # 150, 4

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.linear_model import LogisticRegression

# We set C to a large number to turn off regularization.
# We don't fit the bias term to simplify the comparison below.
log_reg = LogisticRegression(solver="lbfgs", C=1e5, fit_intercept=False)
log_reg.fit(X_train, y_train)
w_mle_sklearn = np.ravel(log_reg.coef_)
print(w_mle_sklearn)

Out[4]:

[-4.414 -9.111 6.539 12.686]

Automatic differentiation

In this section, we illustrate how to use autograd to compute the gradient of the negative log likelihood for binary logistic regression. We first compute the gradient by hand, and then use PyTorch's autograd feature. (See also the JAX optimization colab.)

Computing gradients by hand

In [7]:

# Binary cross entropy
def BCE_with_logits(logits, targets):
    N = logits.shape[0]
    logits = logits.reshape(N, 1)
    logits_plus = np.hstack([np.zeros((N, 1)), logits])  # e^0=1
    logits_minus = np.hstack([np.zeros((N, 1)), -logits])
    logp1 = -logsumexp(logits_minus, axis=1)
    logp0 = -logsumexp(logits_plus, axis=1)
    logprobs = logp1 * targets + logp0 * (1 - targets)
    return -np.sum(logprobs) / N


# Compute using numpy
def sigmoid(x):
    return 0.5 * (np.tanh(x / 2.0) + 1)


def predict_logit(weights, inputs):
    return np.dot(inputs, weights)  # Already vectorized


def predict_prob(weights, inputs):
    return sigmoid(predict_logit(weights, inputs))


def NLL(weights, batch):
    X, y = batch
    logits = predict_logit(weights, X)
    return BCE_with_logits(logits, y)


def NLL_grad(weights, batch):
    X, y = batch
    N = X.shape[0]
    mu = predict_prob(weights, X)
    g = np.sum(np.dot(np.diag(mu - y), X), axis=0) / N
    return g

In [8]:

w_np = w_mle_sklearn
y_pred = predict_prob(w_np, X_test)
loss_np = NLL(w_np, (X_test, y_test))
grad_np = NLL_grad(w_np, (X_test, y_test))
print("params {}".format(w_np))
# print("pred {}".format(y_pred))
print("loss {}".format(loss_np))
print("grad {}".format(grad_np))

Out[8]:

params [-4.414 -9.111 6.539 12.686]
loss 0.11824007099618833
grad [-0.235 -0.122 -0.198 -0.064]

PyTorch code

To compute the gradient using torch, we proceed as follows.

declare all the variables that you want to take derivatives with respect to using the requires_grad=True argumnet
define the (scalar output) objective function you want to differentiate in terms of these variables, and evaluate it at a point. This will generate a computation graph and store all the tensors.
call objective.backward() to trigger backpropagation (chain rule) on this graph.
extract the gradients from each variable using variable.grad field. (These will be torch tensors.)

See the example below.

In [9]:

# data. By default, numpy uses double but torch uses float
X_train_t = torch.tensor(X_train, dtype=torch.float)
y_train_t = torch.tensor(y_train, dtype=torch.float)

X_test_t = torch.tensor(X_test, dtype=torch.float)
y_test_t = torch.tensor(y_test, dtype=torch.float)

In [10]:

# parameters
W = np.reshape(w_mle_sklearn, [D, 1])  # convert 1d vector to 2d matrix
w_torch = torch.tensor(W, requires_grad=True, dtype=torch.float)
# w_torch.requires_grad_()


# binary logistic regression in one line of Pytorch
def predict_t(w, X):
    y_pred = torch.sigmoid(torch.matmul(X, w))[:, 0]
    return y_pred


# This returns Nx1 probabilities
y_pred = predict_t(w_torch, X_test_t)

# loss function is average NLL
criterion = torch.nn.BCELoss(reduction="mean")
loss_torch = criterion(y_pred, y_test_t)
print(loss_torch)

# Backprop
loss_torch.backward()
print(w_torch.grad)

# convert to numpy. We have to "detach" the gradient tracing feature
loss_torch = loss_torch.detach().numpy()
grad_torch = w_torch.grad[:, 0].detach().numpy()

Out[10]:

tensor(0.1182, grad_fn=<BinaryCrossEntropyBackward>)
tensor([[-0.2353],
        [-0.1223],
        [-0.1976],
        [-0.0638]])

In [11]:

# Test
assert np.allclose(loss_np, loss_torch)
assert np.allclose(grad_np, grad_torch)

print("loss {}".format(loss_torch))
print("grad {}".format(grad_torch))

Out[11]:

loss 0.11824005842208862
grad [-0.235 -0.122 -0.198 -0.064]

Batch optimization using BFGS

We will use BFGS from PyTorch for fitting a logistic regression model, and compare to sklearn.

In [12]:

set_seed(0)
params = torch.randn((D, 1), requires_grad=True)
optimizer = torch.optim.LBFGS([params], history_size=10)


def closure():
    optimizer.zero_grad()
    y_pred = predict_t(params, X_train_t)
    loss = criterion(y_pred, y_train_t)
    loss.backward()
    return loss


max_iter = 10
for i in range(max_iter):
    loss = optimizer.step(closure)
    print(loss.item())

Out[12]:

994707465171814
06939578056335449
06907700002193451
06907700002193451
06907700002193451
06907700002193451
06907700002193451
06907700002193451
06907700002193451
06907700002193451

In [13]:

print("parameters from sklearn {}".format(w_mle_sklearn))
print("parameters from torch {}".format(params[:, 0]))

Out[13]:

parameters from sklearn [-4.414 -9.111 6.539 12.686]
parameters from torch tensor([-4.4146, -9.1144,  6.5401, 12.6912], grad_fn=<SelectBackward>)

In [14]:

p_pred_np = predict_prob(w_np, X_test)
p_pred_t = predict_t(params, X_test_t)
p_pred = p_pred_t.detach().numpy()
np.set_printoptions(formatter={"float": lambda x: "{0:0.3f}".format(x)})
print(p_pred_np)
print(p_pred)

Out[14]:

[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]
[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]

Stochastic optimization using SGD

DataLoader

First we need a way to get minbatches of data.

In [15]:

from torch.utils.data import DataLoader, TensorDataset

# To make things interesting, we pick a batchsize of B=33, which is not divisible by N=100
dataset = TensorDataset(X_train_t, y_train_t)
B = 33
dataloader = DataLoader(dataset, batch_size=B, shuffle=True)
print(X_train_t.shape)
print(
    "{} examples divided into {} batches of size {}".format(
        len(dataloader.dataset), len(dataloader), dataloader.batch_size
    )
)

for i, batch in enumerate(dataloader):
    X, y = batch
    print(X.shape)
    print(y.shape)

Out[15]:

torch.Size([100, 4])
100 examples divided into 4 batches of size 33
torch.Size([33, 4])
torch.Size([33])
torch.Size([33, 4])
torch.Size([33])
torch.Size([33, 4])
torch.Size([33])
torch.Size([1, 4])
torch.Size([1])

In [16]:

datastream = iter(dataloader)
for i in range(3):
    X, y = next(datastream)
    print(y)

Out[16]:

tensor([0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0.,
        1., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 1., 1., 0., 0.])
tensor([1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 0.])
tensor([0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0.,
        0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1.])

Vanilla SGD training loop

In [17]:

set_seed(0)
params = torch.randn((D, 1), requires_grad=True)
nepochs = 100
nbatches = len(dataloader)
criterion = torch.nn.BCELoss(reduction="mean")
learning_rate = 1e-1
loss_trace = []

for epoch in range(nepochs):
    for b, batch in enumerate(dataloader):
        X, y = batch

        if params.grad is not None:
            params.grad.zero_()  # reset gradient to zero
        y_pred = predict_t(params, X)
        loss = criterion(y_pred, y)
        # print(f'epoch {epoch}, batch {b}, loss: {loss.item()}')
        loss_trace.append(loss)

        loss.backward()
        with torch.no_grad():
            params -= learning_rate * params.grad

    # print(f'end of epoch {epoch}, loss: {loss.item()}')

plt.figure()
plt.plot(loss_trace)

Out[17]:

[<matplotlib.lines.Line2D at 0x7fccfc7af190>]

In [18]:

# SGD does not converge to a value that is close to the batch solver...

print("parameters from sklearn {}".format(w_mle_sklearn))
print("parameters from torch {}".format(params[:, 0]))

Out[18]:

parameters from sklearn [-4.414 -9.111 6.539 12.686]
parameters from torch tensor([-0.8046, -2.3841,  1.3471,  2.9677], grad_fn=<SelectBackward>)

In [19]:

# Predicted probabilities from SGD are very different to sklearn
# although the thresholded labels are similar


p_pred_np = predict_prob(w_np, X_test)
p_pred_t = predict_t(params, X_test_t)
p_pred = p_pred_t.detach().numpy()

print(p_pred_np)
print(p_pred)

Out[19]:

[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]
[0.156 0.000 0.976 0.227 0.179 0.000 0.062 0.680 0.570 0.092 0.487 0.000
000 0.000 0.000 0.135 0.877 0.125 0.207 0.904 0.000 0.471 0.000 0.876
298 0.784 0.859 0.842 0.000 0.000 0.000 0.000 0.063 0.000 0.000 0.793
094 0.000 0.000 0.000 0.803 0.107 0.119 0.000 0.000 0.100 0.396 0.553
085 0.779]

In [20]:

y_pred_np = p_pred_np > 0.5
y_pred = p_pred > 0.5
print(y_pred_np)
print(y_pred)
print(np.sum(y_pred_np == y_pred) / len(y_pred))

Out[20]:

[False False  True False False False False  True  True False  True False
 False False False False  True False False  True False  True False  True
 False  True  True  True False False False False False False False  True
 False False False False  True False False False False False False  True
 False  True]
[False False  True False False False False  True  True False False False
 False False False False  True False False  True False False False  True
 False  True  True  True False False False False False False False  True
 False False False False  True False False False False False False  True
 False  True]
0.96

Use Torch SGD optimizer

Instead of writing our own optimizer, we can use a torch optimizer. This should give identical results.

In [32]:

set_seed(0)
params = torch.randn((D, 1), requires_grad=True)
nepochs = 100
nbatches = len(dataloader)
criterion = torch.nn.BCELoss(reduction="mean")
learning_rate = 1e-1
loss_trace = []

# optimizer has pointer to params, so can mutate its state
optimizer = torch.optim.SGD([params], lr=learning_rate)

for epoch in range(nepochs):
    for b, batch in enumerate(dataloader):
        X, y = batch

        y_pred = predict_t(params, X)
        loss = criterion(y_pred, y)
        # print(f'epoch {epoch}, batch {b}, loss: {loss.item()}')
        loss_trace.append(loss)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # print(f'end of epoch {epoch}, loss: {loss.item()}')

plt.figure()
plt.plot(loss_trace)

Out[32]:

[<matplotlib.lines.Line2D at 0x7fccfc091610>]

In [34]:

print("parameters from sklearn {}".format(w_mle_sklearn))
print("parameters from torch {}".format(params[:, 0]))

p_pred_np = predict_prob(w_np, X_test)
p_pred_t = predict_t(params, X_test_t)
p_pred = p_pred_t.detach().numpy()

print("predictions from sklearn")
print(p_pred_np)
print("predictions from torch")
print(p_pred)

y_pred_np = p_pred_np > 0.5
y_pred = p_pred > 0.5
print("fraction of predicted labels that agree ", np.sum(y_pred_np == y_pred) / len(y_pred))

Out[34]:

parameters from sklearn [-4.414 -9.111 6.539 12.686]
parameters from torch tensor([-0.8046, -2.3841,  1.3471,  2.9677], grad_fn=<SelectBackward>)
predictions from sklearn
[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
 0.000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
 0.094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
 0.000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
 0.000 0.999]
predictions from torch
[0.156 0.000 0.976 0.227 0.179 0.000 0.062 0.680 0.570 0.092 0.487 0.000
 0.000 0.000 0.000 0.135 0.877 0.125 0.207 0.904 0.000 0.471 0.000 0.876
 0.298 0.784 0.859 0.842 0.000 0.000 0.000 0.000 0.063 0.000 0.000 0.793
 0.094 0.000 0.000 0.000 0.803 0.107 0.119 0.000 0.000 0.100 0.396 0.553
 0.085 0.779]
fraction of predicted labels that agree  0.96

Use momentum optimizer

Adding momentum helps a lot, and gives results which are very similar to batch optimization.

In [35]:

set_seed(0)
params = torch.randn((D, 1), requires_grad=True)
nepochs = 100
nbatches = len(dataloader)
criterion = torch.nn.BCELoss(reduction="mean")
learning_rate = 1e-1
loss_trace = []

# optimizer has pointer to params, so can mutate its state
optimizer = torch.optim.SGD([params], lr=learning_rate, momentum=0.9)

for epoch in range(nepochs):
    for b, batch in enumerate(dataloader):
        X, y = batch

        y_pred = predict_t(params, X)
        loss = criterion(y_pred, y)
        # print(f'epoch {epoch}, batch {b}, loss: {loss.item()}')
        loss_trace.append(loss)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # print(f'end of epoch {epoch}, loss: {loss.item()}')

plt.figure()
plt.plot(loss_trace)

Out[35]:

[<matplotlib.lines.Line2D at 0x7fccfbf7c4d0>]

In [36]:

print("parameters from sklearn {}".format(w_mle_sklearn))
print("parameters from torch {}".format(params[:, 0]))

p_pred_np = predict_prob(w_np, X_test)
p_pred_t = predict_t(params, X_test_t)
p_pred = p_pred_t.detach().numpy()

print("predictions from sklearn")
print(p_pred_np)
print("predictions from torch")
print(p_pred)

y_pred_np = p_pred_np > 0.5
y_pred = p_pred > 0.5
print("fraction of predicted labels that agree ", np.sum(y_pred_np == y_pred) / len(y_pred))

Out[36]:

parameters from sklearn [-4.414 -9.111 6.539 12.686]
parameters from torch tensor([ -7.1823, -10.0136,  11.5410,  10.1771], grad_fn=<SelectBackward>)
predictions from sklearn
[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
 0.000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
 0.094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
 0.000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
 0.000 0.999]
predictions from torch
[0.005 0.000 1.000 0.007 0.001 0.000 0.000 0.841 0.655 0.000 0.619 0.000
 0.000 0.000 0.000 0.000 1.000 0.000 0.022 1.000 0.000 0.734 0.000 1.000
 0.360 0.995 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.999
 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.698 0.992
 0.000 1.000]
fraction of predicted labels that agree  0.98

Modules

We can define logistic regression as multilayer perceptron (MLP) with no hidden layers. This can be defined as a sequential neural network module. Modules hide the parameters inside each layer, which makes it easy to construct complex models, as we will see later on.

Sequential model

In [22]:

# Make an MLP with no hidden layers

model = nn.Sequential(nn.Linear(D, 1, bias=False), nn.Sigmoid())
print(model)
print(model[0].weight)
print(model[0].bias)

Out[22]:

Sequential(
  (0): Linear(in_features=4, out_features=1, bias=False)
  (1): Sigmoid()
)
Parameter containing:
tensor([[ 0.4141,  0.1839, -0.4931,  0.1987]], requires_grad=True)
None

In [23]:

# We set the parameters of the MLP by hand to match sklearn.
# Torch linear layer computes X*W' + b (see https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)
# where X is N*Din, so W must be Dout*Din. Here Dout=1.
print(model[0].weight.shape)
print(w_np.shape)
w = np.reshape(w_np, [-1, 1]).transpose()
print(w.shape)
model[0].weight = nn.Parameter(torch.Tensor(w))
print(model[0].weight.shape)

Out[23]:

torch.Size([1, 4])
(4,)
(1, 4)
torch.Size([1, 4])

In [24]:

p_pred_np = predict_prob(w_np, X_test)
p_pred_t = model(X_test_t).detach().numpy()[:, 0]
print(p_pred_np)
print(p_pred_t)
assert np.allclose(p_pred_np, p_pred_t)

Out[24]:

[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]
[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]

In [25]:

# we can assign names to each layer in the sequence

from collections import OrderedDict

model = nn.Sequential(OrderedDict([("linear_layer", nn.Linear(D, 1, bias=False)), ("output_activation", nn.Sigmoid())]))
print(model)
print(model.linear_layer)
print(model.linear_layer.weight)
print(model.output_activation)

Out[25]:

Sequential(
  (linear_layer): Linear(in_features=4, out_features=1, bias=False)
  (output_activation): Sigmoid()
)
Linear(in_features=4, out_features=1, bias=False)
Parameter containing:
tensor([[ 0.4850, -0.2075,  0.3661,  0.4047]], requires_grad=True)
Sigmoid()

In [26]:

# some layers define adjustable parameters, which can be optimized.
# we can inspect them thus:
for name, param in model.named_parameters():
    print(name, param.shape)

Out[26]:

linear_layer.weight torch.Size([1, 4])

Subclass the Module class

For more complex models (eg non-sequential), we can create our own subclass. We just need to define a 'forward' method that maps inputs to outputs, as we show below.

In [28]:

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.linear = torch.nn.Linear(D, 1, bias=False)

    def forward(self, x):
        y_pred = torch.sigmoid(self.linear(x))
        return y_pred[:, 0]  # (N,1) -> (N)


set_seed(0)
model = Model()
w = np.reshape(w_np, [-1, 1]).transpose()
model.linear.weight = nn.Parameter(torch.Tensor(w))

p_pred_np = predict_prob(w_np, X_test)
p_pred_t = model(X_test_t)  # calls model.__call__ which calls model.forward()
p_pred = p_pred_t.detach().numpy()
print(p_pred_np)
print(p_pred)
assert np.allclose(p_pred_np, p_pred)

Out[28]:

[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]
[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]

SGD on a module

We can optimize the parameters of a module by passing a reference to them into the optimizer, as we show below.

In [30]:

nepochs = 100
nbatches = len(dataloader)
criterion = torch.nn.BCELoss(reduction="mean")
learning_rate = 1e-1
loss_trace = []

set_seed(0)
model = Model()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(nepochs):
    for b, batch in enumerate(dataloader):
        X, y = batch

        y_pred = model(X)  # predict/ forward function
        loss = criterion(y_pred, y)
        # print(f'epoch {epoch}, batch {b}, loss: {loss.item()}')
        loss_trace.append(loss)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # print(f'end of epoch {epoch}, loss: {loss.item()}')

plt.figure()
plt.plot(loss_trace)


y_pred_np = predict_prob(w_np, X_test)
y_pred_t = model(X_test_t)
y_pred = y_pred_t.detach().numpy()
print(y_pred_np)
print(y_pred)

Out[30]:

[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]
[0.235 0.000 0.994 0.382 0.257 0.000 0.097 0.884 0.705 0.130 0.756 0.000
000 0.000 0.000 0.264 0.970 0.168 0.339 0.976 0.000 0.723 0.000 0.965
567 0.934 0.945 0.961 0.000 0.000 0.000 0.000 0.094 0.000 0.000 0.917
166 0.000 0.000 0.000 0.936 0.226 0.196 0.000 0.000 0.123 0.591 0.795
127 0.949]

Batch optimization on a module

SGD does not match the results of sklearn. However, this is not because of the way we defined the model, it's just because SGD is a bad optimizer. Here we show that BFGS gives exactly the same results as sklearn.

In [31]:

set_seed(0)
model = Model()
optimizer = torch.optim.LBFGS(model.parameters(), history_size=10)

criterion = torch.nn.BCELoss(reduction="mean")


def closure():
    optimizer.zero_grad()
    y_pred = model(X_train_t)
    loss = criterion(y_pred, y_train_t)
    loss.backward()
    return loss


max_iter = 10
loss_trace = []
for i in range(max_iter):
    loss = optimizer.step(closure)
    # print(loss)

y_pred_np = predict_prob(w_np, X_test)
y_pred_t = model(X_test_t)
y_pred = y_pred_t.detach().numpy()
print(y_pred_np)
print(y_pred)

Out[31]:

[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]
[0.002 0.000 1.000 0.012 0.002 0.000 0.000 0.979 0.740 0.000 0.706 0.000
000 0.000 0.000 0.001 1.000 0.000 0.009 1.000 0.000 0.650 0.000 1.000
094 0.998 1.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.998
000 0.000 0.000 0.000 0.999 0.000 0.000 0.000 0.000 0.000 0.281 0.909
000 0.999]

Multi-class logistic regression

For binary classification problems, we can use a sigmoid as the final layer, to return probabilities. The corresponding loss is the binary cross entropy, nn.BCELoss(pred_prob, true_label), where pred_prob is of shape (B) with entries in [0,1], and true_label is of shape (B) with entries in 0 or 1. (Here B=batch size.) Alternatively the model can return the logit score, and use nn.BCEWithLogitsLoss(pred_score, true_label).

For multiclass classifiction, the final layer can return the log probabilities using LogSoftmax layer, combined with the negative log likelihood loss, nn.NLLLoss(pred_log_probs, true_label), where pred_log_probs is of shape B*C matrix, and true_label is of shape B with entries in {0,1,..C-1}. (Note that the target labels are integers, not sparse one-hot vectors.) Alternatively, we can just return the vector of logit scores, and use nn.CrossEntropyLoss(logits, true_label). The above two methods should give the same results.

In [ ]:

# code me