GitHub Repository: uob-COMS30035/lab_sheets_public
Path: blob/main/lab1/lab1-intro_to_numpy_scikitlearn.ipynb
³⁴⁰ views

Kernel: Python 3 (ipykernel)

Lab 1: Python, Jupyter Notebooks, NumPy and an Introduction to Scikit-learn

This lab introduces Jupyter Notebooks and key Python libraries (such as NumPy and Matplotlib) as well as introducing scikit-learn, a popular machine learning library. It assumes you know the basics of the Python language, so if you need to, check out the University of Bristol Beginning Python course. (That's a course created by Matt Williams: you may find his other online training materials useful.) Finally, this lab will have some exercises where you use Scikit-learn for linear and logistic regression.

The following libraries will be used throughout the unit:

NumPy, for scientific computation
Pandas, for data analysis
Matplotlib, to plot any kind of data
Scikit-learn, for machine learning

The libraries above have complete and very good documentation which can be used to learn other features of the libraries or for questions and examples. The documentation is available either online (links above) or via Python itself, e.g. help(numpy.array) in the Python interpreter.

If you are using the machines supplied in the lab rooms you can ensure you have access to all these libraries by following the Lab Work instructions given on the unit github page.

If you are using your own machine then note that the following libraries are required for this lab so make sure these are installed using pip3 or Anaconda (recommended to use a virtual environment):

numpy
pandas
matplotlib
seaborn
scikit-learn
scikit-image
jupyterlab

For example, to install scikit-learn in a new conda environment, run

$ conda create -n COMS30035_labs
$ conda activate COMS30035_labs 
$ conda install scikit-learn

For further help see the installation guides on the libraries documentation.

Jupyter Notebook

This module's labs will be run on Jupyter Lab, an interactive coding environment embedded in a webpage supporting various programming languages (Python, R, Lua, etc.) through the concept of kernels. From a command line, cd to the directory containing your lab notebooks, then call jupyter lab.

Jupyter allows you to enrich your code with complex comments formatted in Markdown and $\LaTeX$ , as well as to place the results of your computation right below your code.

Notebooks are organised in cells which can contain either code (in our case, this will be Python code) or text, which can be easily and nicely formatted using the Markdown notation.

To edit an already existing cell simply double-click on it. You can use the toolbar to insert new cells, edit and delete them (or use keyboard shortcuts which are very handy to speed up coding).

Cells can be run, by hitting shift+enter when editing a cell or by clicking on the Run button at the top. Running a Markdown cell will simply display the formatted text, while running a code cell will execute the commands executed in it.

Note: when you run a code cell, all the created variables, implemented functions and imported libraries will be then available to every other code cell. However, it is commonly assumed that cells will be run sequentially in terms of prerequisites. To reset all variables and functions (for debugging) simply click Kernel > Restart from the Jupyter menu.

A bit on Markdown language (and a bit of LaTeX and HTML) if you're interested

Markdown cells allow you to write fancy and simple comments: all of this is written in Markdown - double click on this cell to see the source. Introduction to Markdown syntax can be found here.

As Markdown is translated to HTML upon displaying it also allows you to use pure HTML: more details are available here.

Finally, you can also display simple $\LaTeX$ equations in Markdown thanks to MathJax support. For inline equations wrap your equation between $ symbols: $\frac{x}{y}=z$ ; for display mode equations use $$: $\frac{x}{y}=z$

## Importing the libraries

Before we start this lab we need to import the aforementioned libraries, using the `import` keyword and bind the libraries to the `np`, `pd` and `plt` etc namespaces with the `as` keyword.

In [0]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from skimage import io
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

1) NumPy

NumPy is designed for scientific computing. NumPy defines its own multidimensional array which can be created with:

np.array([[1, 2], [3, 4], [5, 6]])

For more details, type help(np.array) in your Python console or visit online help here.

1.1) Array operations

create two arrays, A and B:

A = np.array([[2, 3], [4, -1], [5, 6]])
B = np.array([[5, 2], [8, 9], [2, 1]])

and perform the following operations, printing the array C:

$C = 3A$
$C = A + B$
$C = AB^T$ (dot product or inner product)
$C = A \odot B$ (Hadamard product or elementwise product)

In [0]:

# write your code here

1.2) More array operations

Calculate now the sum, mean, and variance of the matrix A, using NumPy functions/array properties mean, sum, var.

Hint: help(np.sum) or look here. Hint: help(np.mean) or look here. Hint: help(np.var) or look here.

Afterwards, calculate the sum of the rows and then the columns of A. Hint, specify the parameter axis.

In [0]:

# write your code here

1.3) Implement the sigmoid function using numpy.

The sigmoid function is a non-linear function used in machine learning (logistic regression) and also deep learning (as an activation function).

sigmoid(x) = \frac{1}{1+e^{-x}}

where $x$ could now be either a real number, a vector, or a matrix.

Implement the sigmoid function by defining a function called sigmoid which takes 1 argument $x$ , a scalar or numpy array of any size and outputs the $sigmoid(x)$ .

In [0]:

# write your code here

What is the sigmoid of the array:

A = np.array([-5, 0, 5])

Plot the sigmoid curve for $x \in [-5, 5]$ . Hint, use numpy arange or linspace.

In [0]:

# write your code here

1.4) Standardise columns using numpy

A common technique used in machine learning is to standardise the data to ensure all features have values that lie on a comparable scale. Standardisation helps visualise data but also helps with convergence and to achieve high predictive performance for some machine learning algorithms.

To standardise a dataset we centre the data by subtracting the mean of each feature, then scale by dividing by the standard deviation of the feature. Assuming the data is arranged with features in columns and training instances in rows, standardisation will result in each column vector of the data matrix having a mean of 0 and standard deviation of 1.

Implement a standardiseCols(x) function to standardise the columns of a numpy array.

Note, in Python you are able to perform mathematical operations between arrays of different shapes (such as substracting the row vector of means from a matrix) due to broadcasting, for more information read here.

In [0]:

# write your code here

Standardise the columns of the array below by calling the standardiseCols(x) function:

x = np.array([
    [0, 3, 5],
    [1, 6, 4],
    [3, -2, 8],
    [-1, 1, 10]
])

Print the mean and standard deviation of the columns of the standardised array.

In [0]:

# write your code here

1.5) Reshaping numpy arrays

The attribute np.shape and function np.reshape() are commonly used in machine learning:

X.shape is used to get the shape (dimension) of a matrix/vector X.
X.reshape(...) is used to create a new array containing the elements of X with the provided shape.

For example, in computer vision, an image is represented by a 3D array of shape $(length, height, colour)$ where the colour represents the three RGB (red, green, blue) channels. Let's first load and plot the image. In order for the image to be given as an input into a machine learning algorithm, the 3D array needs to be reshaped to a vector of shape $(length \times height \times 3, 1)$ , that's your task below.

In [0]:

image = io.imread('flower.png')
io.imshow(image)
image.shape

Reshape the $image$ array to vector $v$ and print the shape of the created vector:

In [0]:

# write your code here

Note on array dimensions

The array $a$ below is a 1-dimensional array which has some slightly non-intuitive effects, such as the transpose is the same.

In [0]:

a = np.arange(5)
print(a)
print(a.shape)
print(a.T)
print(a.T.shape)

1-D arrays should be avoided and instead column or row vectors should be used which can be formed from 1-D arrays using reshape. Note the double square bracket.

The row vector of $a$ is:

In [0]:

a = a.reshape(1,-1)
print(a)
a.shape

The column vector of $a$ is:

In [0]:

a = a.reshape(-1,1)
print(a)
a.shape

You can check the dimensions are what you want by using the assert command:

In [0]:

assert(a.shape == (5,1))

Note on function and object property

As Python is an object oriented language, the difference between function and object property should be understood. An object instance, e.g. NumPy array A = np.array([[1, 2], [3, 4], [5, 6]]) inherits all the functions from the class numpy.array. Therefore, to sum all elements of array A we can choose two approaches:

A.sum(), or
np.sum(A).

the first one is advisable.

Moreover, some objects have properties (e.g. shape of an array). Instead of calling the shape function, an array object has the shape property, i.e.:

A.shape
np.shape(A)

the first one is advisable.

2) Scikit-learn

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It provides various tools for model fitting, data preprocessing, model selection and evaluation as well as many other utilities. The exercises takes you through a very simple workflow for training and evaluating machine learning models.

Scikit-learn basics

2.1) Datasets

Scikit-learn can be used to import datasets using the dataset loader to load small standard, or 'toy', datasets (such as iris classification or California housing pricing) or the dataset fetcher to download and load larger dataset from the ‘real world’.

Firstly, load the California housing dataset and print the number of examples and features and feature names in the dataset. Note, fetch_california_housing has already been imported from sklearn.datasets.

Then using seaborn create a pairplot (or scatterplot matrix) to show feature joint relationships and individual feature distributions. Note, the data must be in a pandas dataframe.

In [0]:

# write your code here

Pairplots are a quick and effective way to perform exploratory data analysis (EDA) to find patterns, relationships and anomalies to guide subsequent analysis. A pairplot allows us to see both the distribution of single variables and relationships between two variables.

2.2) Preprocessing

The scikit-learn has a preprocessing package which provides several common functions to change the raw data into something more suitable for the machine learning algorithm. In general, learning algorithms benefit from standardisation of the dataset or if some outliers are present then robust scalers or transformers are more appropriate.

Standardise the California housing training dataset loaded in the previous exercise using StandardScaler. Check the features have zero mean and unit variance/standard deviation.

In [0]:

# write your code here

2.3) Train test split

Training and evaluating a model on the same dataset will lead to overfitting and poor performance on unseen data. To measure the generalisation ability of a model, it is common practice to hold out part of the available data as a test set. Scikit-learn provides a train_test_split function which performs a random split into training and testing sets.

When evaluating different hyperparameters for machine learning models then there is still a risk of overfitting on the test set because the parameters can be tweaked until the model performs optimally. To solve this problem, yet another part of the dataset can be held out as a so-called validation set which is used for evaluating hyperparameter values before final evaluation can be done on the test set. However, this can drastically reduce the number of samples which can be used for model training and the performance depends on the training and validation splits. To get round this problem cross-validation can be used which trains multiple models on "folds" of the training set, read about cross-validation in scikit-learn here.

Your task is to split the California housing (scaled) training data set so 70% of the data is for training and the remaining 30% to test.

How many training examples are there in the training and test sets?

Should the random_state parameter be specified for the train_test_split function?

In [0]:

# write your code here

2.4) Model fitting

Scikit-learn provides many built-in machine learning algorithms and models, called estimators. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data. Each estimator can be fitted to some data using its fit method, as was done on the previous exercise to standardise the raw data.

Fit a simple linear regression model to the training data from the train-test split.

Print the linear model coefficients.

In [0]:

# write your code here

2.5) Model evauluation

Scikit-learn supports simple and quick evaluation of machine learning models using the metrics module which quantifies the quality of predictions.

Calculate the root mean squared error (RMSE) of both the training and test sets for the linear model. Is the model overfitting?

Also, plot the model predictions on a scatter plot alongside the real values for the test data. All the data points would lie on a diagonal (prediction = real) if the model was 100% accurate.

In [0]:

# write your code here

3) Linear models for regression

Much of machine learning is about fitting functions to data and we begin with linear models, a class of models that are linear functions of the adjustable parameters. The simplest form of linear models are also linear functions of the input variables (simply known as linear regression). For example, for a 3-dimensional (D=3) input $\mathbf{x}=[x_1, x_2, x_3]^T$ , linear regression is a linear combination of the input variables plus a constant $b$ :

f(\mathbf{x};\mathbf{w},b) = w_1 x_1 + w_2 x_2 + w_3 x_3 + b = \mathbf{w}^T \mathbf{x} + b \qquad(1)

where $\mathbf{w}$ is a 3-dimensional vector of weights and the constant bias $b$ gives the value of the function at the origin.

We want to find the parameters, $\mathbf{w}$ , of the linear function that best fits our training dataset of input-output pairs. We will first express our dataset of N examples as an NxD matrix called the design matrix, $X$ , and the corresponding observed outputs into an Nx1 column vector, $\mathbf{y}$ .

\mathbf{y} = \left[ \begin{array}{c}y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(N)} \end{array} \right], \qquad X = \left[ \begin{array}{c}\mathbf{x}^{(1)\top} \\ \mathbf{x}^{(2)\top} \\ \vdots \\ \mathbf{x}^{(N)\top} \end{array} \right] = \left[ \begin{array}{cccc} x_1^{(1)} & x_2^{(1)} & \cdots & x_D^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_D^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ x_1^{(N)} & x_2^{(N)} & \cdots & x_D^{(N)} \\ \end{array} \right]

\mathbf{f} = X\mathbf{w} + b \qquad(2)

Expressing the data in the form of a matrix and vector allows us to use the notation of linear algebra to derive the solution. This improves readability and maps more closely to how this is implemented efficiently in code with matrix-vector operations.

We can compute the total square error of the function values above, compared to the observed training set values:

\sum_{n=1}^N [y^{(n)} - f(\mathbf{x}^{(n)};\mathbf{w},b)]^2 = (\mathbf{y}-\mathbf{f})^T(\mathbf{y}-\mathbf{f}) \qquad(3)

The least-squares fitting problem is finding the parameters that minimise this error.

Note that there is a notational trick that allows for the bias term, $b$ , to be omitted from equations 1 and 2 above. If we construct our design matrix to include an additional column/dimension (so that it is now NxD+1) containing a vector of 1's then the bias term can simply be interpreted as another weight (i.e. $b = \mathbf{w}_{D+1}^{(n)}\mathbf{x}_{D+1}^{(n)}$ where $\mathbf{x}_{D+1}^{(n)} = 1$ for all $n$ ).

3.1) Least squares fitting

We will begin by generating a series of points from a given quadratic (non-linear) equation $y=(x-1)(x-5) = x^2-6x+5$ with normally distributed noise added i.e. $\mathcal{N}(\mu=0,\sigma=5)$ .

In [0]:

np.random.seed(0)
N = 30
sigma = 5
x = np.sort(np.random.sample((N,1)))*10
y = (x-1)*(x-5) + np.random.normal(0,sigma,N).reshape(-1, 1)

fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(x, y)
ax.set_xlabel('x')
ax.set_ylabel('y')

Fit a linear function to the generated data, print $w_1$ and $b$ (or $w_0$ ) and plot the learnt function.

As a helper:

1. Create the array $X_{bias}$ by concatenating a vector of ones to $X$ (use np.concatenate)
1. Calculate $w$ using np.linalg.lstsq
1. Generate predictions $y_{pred}$ for the fitted function
1. Plot the learnt linear function alongside the original data points.

In [0]:

# write your code here

Now use scikit-learn to fit the linear model. Do you get the same $w_1$ and $b$ (coefficient and intercept)? Is the model overfitting or underfitting?

In [0]:

# write your code here

3.2) Non-linear regression: polynomial fitting

We can extend the class of linear models by considering linear combinations of a set of fixed non-linear functions, or basis functions, (see Bishop section 3.1 for more information on basis functions) applied to the input data. Examples of basis functions include the Gaussian basis function and sigmoid basis function but here we will use a polynomial basis function (see Bishop section 3.1 for more details of basis functions). The purpose for doing this is to transform the data into a higher dimensional space such that a linear function can be fit to it.

To fit a polynomial function, we use the following matrix $\Phi$ with the rows $\phi$ consisting of the polynomial basis function.

\Phi = \left[ \begin{array}{ccccc} \phi_1(x^{(1)}) & \phi_2(x^{(1)}) & \phi_3(x^{(1)}) & \cdots & \phi_K(x^{(1)}) & \\ \phi_1(x^{(2)}) & \phi_2(x^{(2)}) & \phi_3(x^{(2)}) & \cdots & \phi_K(x^{(2)}) & \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \phi_1(x^{(N)}) & \phi_2(x^{(N)}) & \phi_3(x^{(N)}) & \cdots & \phi_K(x^{(N)}) & \\ \end{array} \right] = \left[ \begin{array}{ccccc} 1 & x^{(1)} & {(x^{(1)})}^2 & \cdots & {(x^{(1)})}^{K-1} \\ 1 & x^{(2)} & {(x^{(2)})}^2 & \cdots & {(x^{(2)})}^{K-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x^{(N)} & {(x^{(N)})}^2 & \cdots & {(x^{(N)})}^{K-1} \\ \end{array} \right]

f(\mathbf{x}) = \mathbf{w}^T \phi (\mathbf{x}) \qquad(4)

Notice that the function we are fitting now is non-linear in $\mathbf{x}$ but we can still apply linear regression in the same way as before because the function is still linear in both $\phi(\mathbf{x})$ and $\mathbf{w}$ .

Use scikit-learn PolynomialFeatures to fit a second order polynomial to the data, plot the fit line.

In [0]:

# write your code here

Increase the order of the polynomial. When does the model start to overfit? Which order of polynomial would you use?

Wrap up

That's it for lab 1, a revision of Jupyter Notebooks and NumPy as well as an introduction to Scikit-learn and regression. We have covered:

The interactive coding environment of Jupyter Notebooks which allows Markdown and $\LaTeX$ to be added to code. Notebooks show the results of your computation right below your code which is great for quick coding experiments and debugging. These features allow Notebooks to be used to create human-readable documents containing visualiations of results and analysis.
The NumPy package performing a range of array operations, standardised the columns of an array and reshaped an image array into a column vector.
The machine learning library Scikit-learn which can be used to import datasets, perform data preprocessing and then train and evaluate models. We went through a basic machine learning pipeline with the Boston housing dataset.
We finally built a linear regression model from scratch by least squares fitting. Then, we used scikit-learn to fit the same function with only a couple lines of code and used the same methodology to fit a non-linear function after a polynomial transformation.

References

Bishop Pattern Recognition and Machine Learning: Chapter 3 for linear regression,

Materials used to create the lab

University of Bristol's Symbols, Patterns and Signals course
Andrew Ng's Neural Networks and Deep Learning course on Coursera