Path: blob/master/notebooks/book1/11/LinearRegressionProbML.ipynb
1192 views
Linear regression from
Let's implement different components of Linear Regression algorithm specifically model inference, loss function, optimization and evaluation measures in a vectorized form from scratch using basic python libraries like numpy.
We will demonstrate working of our implementation on a couple of synthetic datasets.
Quick recap
A quick recap of vectorized operations for these components:
Training data contains features and label that is real number.
Model or inference:
Loss function:
Optimization:
Normal equation:
Gradient descent:
RMSE:
Let's first import necessary libraries
C1: Training Data
Let's examine the shapes of training data for sanity check.
Let's divide the data into training and test set. We will set aside 20% examples for testing.
Let's do a quick sanity check to make sure the sizes of feature and labels sets are identical both in training and test sets:
Let's quickly check the first few examples and labels
Let's visualize the training set.
We have a training set consisting a single feature so we will fit a simple linear regression model with one feature. It's form is: .
As discussed in the lectures, we add a special dummy feature and set it to 1. We create a helper function for that.
Let's write a test case to test this function:
For that let's take two examples and three features. The first example is a feature vector:
And the second example is:
And recall that a fecture matrix has shape corresponding to features of all examples.
The corresponding feature matrix appears as follows:
Here the feature vectors are transposed and represented as rows:
The first row corresponds to the first example and
The second row corresponds to the second example .
Once we add the dummy feature, the resulting matrix becomes:
Let's preprocess the training set to add the dummy feature
C2. Model
Linear regression model uses linear combination of features to obtain output labels. In vectorized form, this can be written as
Note
Model is parameterized by its weight vector.
It is described by its mathematical form and weight vector.
Implementation
The general vectorized form is as follows:
where
is number of examples in dataset (train/test/validation).
is the number of features.
is a feature matrix contain features for examples along rows. (Notice capital case bold X used for matrix).
is a weight vector containing weights one for each feature. (notice small case bold w).
is a label vector containing labels for examples with shape .
We test this function with the following feature matrix :
and weight vector
Let's perform matrix-vector multiplication between the feature matrix and a weight vector to obtain labels for all examples:
Demonstration on synthetic dataset
Since we have not yet trained our model, let's use a random weight vector to get predictions from our model for the given dataset:
Let's compare the prediction with the actual value:
Actual labels are
Since we used a random weight vector here, most of the predicted labels do not match the actual labels.
Comparison of vectorized and non-vectorized version of model inference
Let's write a test for this function with the same set up as the vectorized implementation.
Let's compare run time of vectorized and non-vectorized versions on dataset with 100 examples.
Let's try to check the difference in their performance on large dataset of 1 million data points.
Note that the time required for non-vectorized inference is order of magnitude more than the vectorized inference.
C3. Loss function implementation
Implementation
The loss function is calculated as follows:
where
is a feature matrix contain features for examples along rows.
is a weight vector containing weights one for each feature.
is a label vector containing labels for examples in a vector of shape .
We will test this function with the following configuration:
Feature matrix ():
Weight vector ():
Label vector :
Let's compute the loss i.e.
Since we have not yet trained our model, let's use a random weight vector to calculate loss for linear regression model with single feature on synthetic dataset.
Demonstration on synthetic dataset
Let's try with the actual weight vector used for data generation process and examine the loss.
C4: Optimization
Let's implement optimization component of linear regression model.
It is implemented with one of the following two methods:
Normal equation method, that sets the partial derivative of the loss function w.r.t. weight vector to 0 and solves the resulting equation to obtain the weight vector.
Gradient descent method, that iteratively adjusts the weight vector based on the learning rate and the gradient of loss function at the current weight vector.
Let's generate training data with 100 examples with linear regression model of known parameters.
We will use this data to test optimization procedure, where we compare the estimated weight vectors with the actual weight vector.
Normal equation
The weight vector is estimated by matrix multiplication of psuedo-inverse of feature matrix and the label vector.
The vectorized implementation is fairly straight forward.
We make use of
np.linalg.pinv
for calculating psuedoinverse of the feature matrix.
We test this function with the generated training set whose weight vector is known to us.
We set up the test with feature matrix, label vector and the expected weight vectors.
Next we estimate the weight vector with
normal_equation
function.We test (a) shape and (ii) match between expected and estimated weight vectors.
Gradient descent (GD)
GD is implemented as follows:
Randomly initialize to .
Iterate until convergence:
Calculate partial derivative of loss w.r.t. weight vector.
Calculate new values of weights.
Update weights to new values simultaneously.
We use number of epochs as a convergence criteria in this implementation.
Partial derivative of loss function
Let's first implement a function to calculate partial derivative of loss function, which is obtained with the following equation:
The multiplication of transpose of feature matrix with the difference of predicted and actual label vectors.
Let's write a test case for gradient calculation with the following setup:
Feature matrix ():
Weight vector ():
Label vector :
Let's compute the partial derivative of loss i.e.
Weight updates
Next let's implement the weight update part:
We obtain the new weight from the old one by substracting gradient weighted by the learning rate.
Let's workout a weight update for:
Weight vector ():
Grad vector :
Learning rate = 0.001
The updated weights are given by:
With these building blocks in place, let's implement gradient descent procedure:
In order to test this function, we will use the synthetic training data that was generated earlier. We know the actual weights, that can be compared against the weights obtained from gradient descent procedure.
Since we store weights in each iteration, we can use them to plot intermediate models in order to understand the trajectory taken by GD.
In the following code:
err_all
contains loss values across all iterations andw_all
contains weight vectors across all iterations.
Let's plot the learning curves.
Learning rate and convergence
Let's vary the learning rate and observe change in the convergence characteristics of GD.
We will use:
to run GD for 2000 epochs each.
Compare the convergence characteristics.
Let's try learning rate (Doesn't converge)
Linear regression: combining all components
Let's combine all components that we implemented in last few sections in a single place and make it available for training linear regression model on different datasets.
Let's combine implementation of different components of Linear Regression that we implemented so far into a single LinearRegression
class.
Let's compare weight vectors obtained by normal equation, GD, MBGD and SGD:
Linear regression on multiple features and single label
We will compare it with weight vector estimated from different methods.