GitHub Repository: uob-COMS30035/lab_sheets_public
Path: blob/main/lab2/lab2-neural_nets.ipynb
³⁴⁰ views

Kernel: Python 3 (ipykernel)

Lab 2: Neural networks

In this lab we use scikit-learn to implement a simple MLP and PyTorch for a more complex neural network.

Importing the libraries

This lab requires h5py package to interact with a dataset that is stored in an H5 file and the imageio & PIL packages for image processing. If you are using the lab machines with the coms30035 Anaconda virtual environment then these are already installed. If you are using your own machine then you may need to install these packages using either pip or conda. Make sure you have all the auxiliary files (e.g. utils.py) required for this lab. The easiest way of ensuring you have all of these is just to clone the git respository.

Firstly, we'll import the required packages by running the cell below.

In [ ]:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.datasets import make_moons

import h5py
import imageio
from PIL import Image 
from utils import * # image processing functions from utils.py

1) Neural Network for Image Classification

In this section we are going to recap the ideas behind deep neural networks and implement a network to classify cat images (I can sense the excitement).

The dataset we will use is a set of labeled images containing cats (label=1) and non-cats (label=0). Run the cell below to load in train and test splits of the dataset from the local .h5 file. Make sure you have downloaded this data file in addition to the notebook from the lab repository.

In [ ]:

train_x_orig, train_y, test_x_orig, test_y, classes = load_data()
print(train_y.shape)

The following cell displays an image in the dataset - change the index and re-run the cell to see other images.

In [ ]:

# Example of a cat picture
index = 200
plt.imshow(train_x_orig[index])
print (f'y = {train_y[0,index]}. It\'s a {classes[train_y[0,index]].decode("utf-8")}!')

1.1) Explore the dataset

Print the values of:

a) number of training examples (num_train)
b) number of test examples (num_test)
c) size of the image (height/width or the number of pixels).

Note, train_x_orig is a numpy-array of shape (num_train, num_px, num_px, 3).

In [ ]:

# write your code here

1.2) Reshape the images

Reshape the training (train_x_orig) and test (test_x_orig) data sets so that each image is flattened into column vector.

Figure 1: Image to vector conversion.

Print the shape of the reshaped training and testing datasets.

In [ ]:

# write your code here

1.3) Standardise the images

The pixel value is a vector of three numbers (representing the RGB channels) ranging from 0 to 255. A common preprocessing step in machine learning is to standardise your dataset (subtract the mean and then divide by the standard deviation). For picture datasets, it is simpler and more convenient to apply min-max normalisation by dividing every value by 255.

Apply min-max normalization to the dataset and check the minimum and maximum are 0 and 1, respectively.

In [ ]:

# write your code here

1.4) Logistic regression classifier

Logistic regression, despite its name, is a linear model for classification. The probabilities describing the possible outcomes of a classification are modeled using a logistic (sigmoid) function.

sigmoid(x) = \frac{1}{1+e^{-x}}

Logistic regression can be thought of as a neural network with a single node with a sigmoid activation function.

Figure 2: Logistic regression classifier.

Use scikit-learn's logistic regression to train a cat classifier. What's the classifier's accuracy on the training and test sets?

Hint, the classifier fit method has the following inputs:

Training data with shape (n_samples, n_features)
Target values with shape (n_samples,). Use .flatten() to collapse a 2-D array to a 1-D array.

In [ ]:

# write your code here

Neural network architecture

We will initially build a fully connected neural network with one hidden layer (i.e. one layer between input and output). When using more than one hidden layer we define it as a deep neural network.

Figure 3: 2-layer neural network.

INPUT: $x = [x_0,x_1,...,x_{12287}] \quad x_i \in [0,1]$

The input is a (64,64,3) image which we have already flattened to a vector of size (12288,1) and standardised.

LINEAR: $z^{[1]} = W^{[1]} x + b^{[1]}$

The input vector is multiplied by the weight matrix $W^{[1]}$ of size $(n^{[1]}, 12288)$ and then a bias term is added in a linear transformation. $n^{[1]}$ is the number of neurons in the hidden layer.

RELU: $a^{[1]} = RELU(z^{[1]}) = max(0,z^{[1]})$

A non-linear activation function is then applied, in this case a rectified linear unit (or ReLU which outputs the maximum of the input and 0).

LINEAR: $z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}$

A linear transformation is applied to the output of the hidden layer $a^{[1]}$ .

SIGMOID: $\hat{y} = a^{[2]} = \sigma(z^{[2]}) = \frac{1}{1 + e^{-z^{[2]}}}$

Given it's a binary classification task (cat or no cat) then the sigmoid or logistic function is the activation of the output layer (this is automatically selected by scikit-learn).

OUTPUT: $\hat{y}$

The output is the probability the photo contains a cat so if the value is greater than 0.5 the prediction is cat.

The overall process from inputs to outputs is known as forward propagation, see Bishop section 5.1 for more information.

Training a neural network

Neural networks are trained by learning the weights $W$ and biases $b$ in the hidden and output layers such that the network outputs the correct labels as optimally as possible. How well the network is performing is defined by a loss function, here the log loss (also called logistic regression loss or cross-entropy loss). The cost function $J$ which drives training is the sum of all the errors (log losses) for all the training examples in the training set.

J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(\hat{y}^{(i)})+(1-y^{(i)})\log(1-\hat{y}^{(i)})

The goal of neural network optimisation is to learn weights and biases that minimise the cost function. This is done by backpropagating the cost function error from the output layer, through the network to the first hidden layer. During this process the weights and biases are updated by gradient decent optimisation (using the cost function gradient with respect to all weights and bias).

For a parameter $\theta$ , a simple gradient decent update rule is $\theta = \theta - \eta \text{ } d\theta$ , where $\eta$ is the learning rate (see Bishop section 5.2 and 5.3 for more information). More complex optimisation algorithms exist, for example the popular Adam optimiser.

1.5) Implement the neural network in scikit-learn

Thankfully, we don't need to construct the neural network manually and instead can use scikit-learn's multi-layer perceptron (MLP) classifier.

Construct and train a neural network using MLP classifier with the following hyperparameters:

Single hidden layer with 64 neurons.
RELU activation function
Stochastic gradient descent optimiser
Initial learning rate $\eta = 0.001$ ( $1e-3$ )
No regularisation $\alpha = 0$

Plot the loss curve (using clf.loss_curve_) which shows the network learning as the number of iteration increases.

Has the loss curve flat-lined? Hint, you may have to specify the max_iter and n_iter_no_change parameters.

In [ ]:

# write your code here

1.6) Evaluation

How does the neural network perform on the training and test data? Print the training and testing accuracy. Is classifier overfitting?

A confusion matrix is a way to visualise the performance of a classification model by showing the counts of the predicted and actual labels. The following terms are important metrics in classification tasks:

total number of positives in the dataset i.e. cat images (P)
total number of negatives in the dataset i.e. non cat images (N)
number of correct positive predictions (TP)
number of correct negative predictions (TN)
number of incorrect positive predictions (FP)
number of incorrect negative predictions (FN)
accuracy $= \frac{TP+TN}{P+N}$
sensitivity, recall or true positive rate $= \frac{TP}{P}$
specificity or true negative rate $= \frac{TN}{N}$
precision $= \frac{TP}{TP+FP}$

Plot the confusion matrix for the test data. Where is the classifier making mistakes?

In [ ]:

# write your code here

Hyperparameter tuning using cross-validation

The performance of a neural network after training is highly dependent on how the hyperparameters are chosen. In contrast to the network parameters, $W$ and $b$ , a hyperparameter refers to something that is fixed (usually manually chosen by the person training the model) throughout training and used to control the training process. Hyperparameters include the number of hidden layers, number of neurons in the hidden layers, learning rate, mini-batch size (for stochastic gradient descent optimisers) and regularisation parameter $\alpha$ .

Cross-validation (CV for short) is used to evaluate model performance for model selection and to tune hyperparameters. In k-fold CV the training set is split into k smaller sets (called folds) and each fold is used as a validation set for models trained on all the other folds.

Figure 3: 5 fold cross validation.

The performance of a model is measured by the average score (error) for all the fold. A test set should still be held out for final evaluation. For more information on CV see Bishop section 1.3.

Scikit-learn offers two approaches to search the hyperparameter space using cross validation: GridSearchCV which considers all parameter combinations and RandomizedSearchCV which samples a given number of candidates from a parameter space with a specified distribution.

1.7) Tune the learning rate and regularisation parameter using CV

The learning rate $\eta$ is an important hyperparameter to tune. Choosing a value that is too small will result in training that takes too long to converge and a value too large will cause instabilities in the training that prevent convergence. The regularisation parameter $\alpha$ controls the weighting of L2 regularisation in the cost function to help with overfitting by encouraging smaller weights leading to a smoother decision boundary.

Use RandomizedSearchCV to perform hyperparameter tuning of learning rate and regularisation parameter (limit the number of iterations n_iter ~10 and number of folds cv ~ 3 keep to reasonable training times).

A good methodology is to start with a wide range of hyperparameter values before homing in over a finer range.

Hint, use numpy.logspace to define the range of the hyperparameters spaced over a log scale for example:

alphas = np.logspace(-3, 0, 100)
learning_rates = np.logspace(-4, -2, 100)

In [ ]:

# write your code here

It is common to create deep networks for most tasks with complexity controlled not by the network size (number of layers and nodes per layer) but with regularisation. Regularisation is a general term describing ways to control the complexity of a neural network in order to avoid overfitting. We have already discussed (and tuned) the L2 regularisation term of the error function.

Dropout is another regularisation technique and has been shown as an effective way of preventing overfitting but is unfortunately not implemented scikit-learn MLP. For dropout it's recommended to use either the PyTorch or TensorFlow/Keras frameworks which offer far more flexibility in neural network architectures as opposed to scikit-learn and faster training on GPUs.

Can you improve the performance of the network by tuning the various hyperparameters?

1.8) Just for fun, test your classifier with your own image

Upload a photo into the images folder and change the my_image variable in the cell below.

In [ ]:

my_image = "cat.jpg" # change this to the name of your image file 

fname = "images/" + my_image
image = np.array(imageio.imread(fname))
my_image = np.array(Image.fromarray(image).resize((height,width))).reshape((-1,1))
my_image = my_image/255.
my_predicted_image = nn_clf.predict(my_image.T)

plt.imshow(image)
print ("Your neural network predicts a " + classes[int(np.squeeze(my_predicted_image)),].decode("utf-8") +  " picture.")

2) Using PyTorch for Image Classification

Go to the Learn the Basics PyTorch introduction. Read through the various sections of this introduction, except that you can skip Section 0 "Quickstart" and Section 7 "Save, Load and Use Model". You can 'skim read' most sections but the Sections "Build the Neural Network" and "Optimizing Model Parameters" should be read properly. Note that for each section you can download the assocated Jupyter notebook by using the "Download Notebook" icon at the top of the page.

2.1) Running PyTorch

Download and run the notebook for the "Optimizing Model Parameters" section. Make sure you know what's going on.

2.2) Logistic regression

Alter the neural network so that it is just doing logistic regression. You should do this by altering the __init__ method of the NeuralNetwork class appropriately. Leave all hyperparameters the same and do learning with this simplifed network (just re-run your edited notebook). Compare the test set accuracy you get with that achieved by the original network.

2.3) Improving accuracy

Find a way of improving the (test set) accuracy of both the original network and the network which implements logistic regression. Feel free to experiment!

2.4) Using dropout

Apply dropout to the input layer of the (original) network using a probability of dropout of 0.2. You can do this by adding the following line of code to your __init__ method: self.dropout = nn.Dropout(0.2) and then altering your forward method appropriately. Does dropout help? Explain why it helps or does not help.

Wrap up

This lab has covered quite a bit, let's recap:

We built a fully-connected neural network to classify cats and achieved a test set accuracy of over 70% even before hyperparameter tuning. Note that in the field of computer vision it is common to use convolutional neural network (CNN) architectures instead of fully-connected networks since they show superior performance for image classification. A randomised search cross validation method was presented for hyperparameter tuning although automatic tuning using Bayesian optimisation can produce better results in less time.
We've also explored using PyTorch, again for image classification. We have compared with logistic regression and experimented with using dropout.

References

Bishop Pattern Recognition and Machine Learning: chapter 5 for neural networks.

Materials used to create the lab

University of Edinburgh's Machine Learning and Pattern Recognition (MLPR) course
Andrew Ng's Neural Networks and Deep Learning course on Coursera