Path: blob/main/lab1/lab1-intro_to_numpy_scikitlearn.ipynb
340 views
Lab 1: Python, Jupyter Notebooks, NumPy and an Introduction to Scikit-learn
This lab introduces Jupyter Notebooks and key Python libraries (such as NumPy and Matplotlib) as well as introducing scikit-learn, a popular machine learning library. It assumes you know the basics of the Python language, so if you need to, check out the University of Bristol Beginning Python course. (That's a course created by Matt Williams: you may find his other online training materials useful.) Finally, this lab will have some exercises where you use Scikit-learn for linear and logistic regression.
The following libraries will be used throughout the unit:
NumPy, for scientific computation
Pandas, for data analysis
Matplotlib, to plot any kind of data
Scikit-learn, for machine learning
The libraries above have complete and very good documentation which can be used to learn other features of the libraries or for questions and examples. The documentation is available either online (links above) or via Python itself, e.g. help(numpy.array)
in the Python interpreter.
If you are using the machines supplied in the lab rooms you can ensure you have access to all these libraries by following the Lab Work instructions given on the unit github page.
If you are using your own machine then note that the following libraries are required for this lab so make sure these are installed using pip3 or Anaconda (recommended to use a virtual environment):
numpy
pandas
matplotlib
seaborn
scikit-learn
scikit-image
jupyterlab
For example, to install scikit-learn in a new conda environment, run
For further help see the installation guides on the libraries documentation.
Jupyter Notebook
This module's labs will be run on Jupyter Lab, an interactive coding environment embedded in a webpage supporting various programming languages (Python, R, Lua, etc.) through the concept of kernels. From a command line, cd
to the directory containing your lab notebooks, then call jupyter lab
.
Jupyter allows you to enrich your code with complex comments formatted in Markdown and , as well as to place the results of your computation right below your code.
Notebooks are organised in cells which can contain either code (in our case, this will be Python code) or text, which can be easily and nicely formatted using the Markdown notation.
To edit an already existing cell simply double-click on it. You can use the toolbar to insert new cells, edit and delete them (or use keyboard shortcuts which are very handy to speed up coding).
Cells can be run, by hitting shift+enter
when editing a cell or by clicking on the Run
button at the top. Running a Markdown cell will simply display the formatted text, while running a code cell will execute the commands executed in it.
Note: when you run a code cell, all the created variables, implemented functions and imported libraries will be then available to every other code cell. However, it is commonly assumed that cells will be run sequentially in terms of prerequisites. To reset all variables and functions (for debugging) simply click Kernel > Restart
from the Jupyter menu.
A bit on Markdown language (and a bit of LaTeX and HTML) if you're interested
Markdown cells allow you to write fancy and simple comments: all of this is written in Markdown - double click on this cell to see the source. Introduction to Markdown syntax can be found here.
As Markdown is translated to HTML upon displaying it also allows you to use pure HTML: more details are available here.
Finally, you can also display simple equations in Markdown thanks to MathJax
support. For inline equations wrap your equation between $
symbols: ; for display mode equations use $$
:
1) NumPy
NumPy is designed for scientific computing. NumPy defines its own multidimensional array which can be created with:
For more details, type
help(np.array)
in your Python console or visit online help here.
1.1) Array operations
create two arrays, A
and B
:
and perform the following operations, printing the array C:
(dot product or inner product)
(Hadamard product or elementwise product)
1.2) More array operations
Calculate now the sum, mean, and variance of the matrix A
, using NumPy
functions/array properties mean
, sum
, var
.
Hint: help(np.sum)
or look here.
Hint: help(np.mean)
or look here.
Hint: help(np.var)
or look here.
Afterwards, calculate the sum of the rows and then the columns of A
. Hint, specify the parameter axis
.
1.3) Implement the sigmoid function using numpy.
The sigmoid function is a non-linear function used in machine learning (logistic regression) and also deep learning (as an activation function).
where could now be either a real number, a vector, or a matrix.
Implement the sigmoid function by defining a function called sigmoid
which takes 1 argument , a scalar or numpy array of any size and outputs the .
1.4) Standardise columns using numpy
A common technique used in machine learning is to standardise the data to ensure all features have values that lie on a comparable scale. Standardisation helps visualise data but also helps with convergence and to achieve high predictive performance for some machine learning algorithms.
To standardise a dataset we centre the data by subtracting the mean of each feature, then scale by dividing by the standard deviation of the feature. Assuming the data is arranged with features in columns and training instances in rows, standardisation will result in each column vector of the data matrix having a mean of 0 and standard deviation of 1.
Implement a standardiseCols(x)
function to standardise the columns of a numpy array.
Note, in Python you are able to perform mathematical operations between arrays of different shapes (such as substracting the row vector of means from a matrix) due to broadcasting, for more information read here.
Standardise the columns of the array below by calling the standardiseCols(x)
function:
Print the mean and standard deviation of the columns of the standardised array.
1.5) Reshaping numpy arrays
The attribute np.shape and function np.reshape() are commonly used in machine learning:
X.shape is used to get the shape (dimension) of a matrix/vector X.
X.reshape(...) is used to create a new array containing the elements of X with the provided shape.
For example, in computer vision, an image is represented by a 3D array of shape where the colour represents the three RGB (red, green, blue) channels. Let's first load and plot the image. In order for the image to be given as an input into a machine learning algorithm, the 3D array needs to be reshaped to a vector of shape , that's your task below.
Reshape the array to vector and print the shape of the created vector:
Note on array dimensions
The array below is a 1-dimensional array which has some slightly non-intuitive effects, such as the transpose is the same.
1-D arrays should be avoided and instead column or row vectors should be used which can be formed from 1-D arrays using reshape. Note the double square bracket.
The row vector of is:
The column vector of is:
You can check the dimensions are what you want by using the assert command:
Note on function and object property
As Python is an object oriented language, the difference between function and object property should be understood.
An object instance, e.g. NumPy array A = np.array([[1, 2], [3, 4], [5, 6]])
inherits all the functions from the class numpy.array
. Therefore, to sum all elements of array A
we can choose two approaches:
A.sum()
, ornp.sum(A)
.
the first one is advisable.
Moreover, some objects have properties (e.g. shape of an array). Instead of calling the shape function, an array object has the shape property, i.e.:
A.shape
np.shape(A)
the first one is advisable.
2) Scikit-learn
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It provides various tools for model fitting, data preprocessing, model selection and evaluation as well as many other utilities. The exercises takes you through a very simple workflow for training and evaluating machine learning models.
Scikit-learn basics
2.1) Datasets
Scikit-learn can be used to import datasets using the dataset loader to load small standard, or 'toy', datasets (such as iris classification or California housing pricing) or the dataset fetcher to download and load larger dataset from the ‘real world’.
Firstly, load the California housing dataset and print the number of examples and features and feature names in the dataset. Note, fetch_california_housing has already been imported from sklearn.datasets.
Then using seaborn create a pairplot (or scatterplot matrix) to show feature joint relationships and individual feature distributions. Note, the data must be in a pandas dataframe.
Pairplots are a quick and effective way to perform exploratory data analysis (EDA) to find patterns, relationships and anomalies to guide subsequent analysis. A pairplot allows us to see both the distribution of single variables and relationships between two variables.
2.2) Preprocessing
The scikit-learn has a preprocessing package which provides several common functions to change the raw data into something more suitable for the machine learning algorithm. In general, learning algorithms benefit from standardisation of the dataset or if some outliers are present then robust scalers or transformers are more appropriate.
Standardise the California housing training dataset loaded in the previous exercise using StandardScaler. Check the features have zero mean and unit variance/standard deviation.
2.3) Train test split
Training and evaluating a model on the same dataset will lead to overfitting and poor performance on unseen data. To measure the generalisation ability of a model, it is common practice to hold out part of the available data as a test set. Scikit-learn provides a train_test_split function which performs a random split into training and testing sets.
When evaluating different hyperparameters for machine learning models then there is still a risk of overfitting on the test set because the parameters can be tweaked until the model performs optimally. To solve this problem, yet another part of the dataset can be held out as a so-called validation set which is used for evaluating hyperparameter values before final evaluation can be done on the test set. However, this can drastically reduce the number of samples which can be used for model training and the performance depends on the training and validation splits. To get round this problem cross-validation can be used which trains multiple models on "folds" of the training set, read about cross-validation in scikit-learn here.
Your task is to split the California housing (scaled) training data set so 70% of the data is for training and the remaining 30% to test.
How many training examples are there in the training and test sets?
Should the random_state
parameter be specified for the train_test_split
function?
2.4) Model fitting
Scikit-learn provides many built-in machine learning algorithms and models, called estimators. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data. Each estimator can be fitted to some data using its fit method, as was done on the previous exercise to standardise the raw data.
Fit a simple linear regression model to the training data from the train-test split.
Print the linear model coefficients.
2.5) Model evauluation
Scikit-learn supports simple and quick evaluation of machine learning models using the metrics module which quantifies the quality of predictions.
Calculate the root mean squared error (RMSE) of both the training and test sets for the linear model. Is the model overfitting?
Also, plot the model predictions on a scatter plot alongside the real values for the test data. All the data points would lie on a diagonal (prediction = real) if the model was 100% accurate.
3) Linear models for regression
Much of machine learning is about fitting functions to data and we begin with linear models, a class of models that are linear functions of the adjustable parameters. The simplest form of linear models are also linear functions of the input variables (simply known as linear regression). For example, for a 3-dimensional (D=3) input , linear regression is a linear combination of the input variables plus a constant :
where is a 3-dimensional vector of weights and the constant bias gives the value of the function at the origin.
We want to find the parameters, , of the linear function that best fits our training dataset of input-output pairs. We will first express our dataset of N examples as an NxD matrix called the design matrix, , and the corresponding observed outputs into an Nx1 column vector, .
Expressing the data in the form of a matrix and vector allows us to use the notation of linear algebra to derive the solution. This improves readability and maps more closely to how this is implemented efficiently in code with matrix-vector operations.
We can compute the total square error of the function values above, compared to the observed training set values:
The least-squares fitting problem is finding the parameters that minimise this error.
Note that there is a notational trick that allows for the bias term, , to be omitted from equations 1 and 2 above. If we construct our design matrix to include an additional column/dimension (so that it is now NxD+1) containing a vector of 1's then the bias term can simply be interpreted as another weight (i.e. where for all ).
3.1) Least squares fitting
We will begin by generating a series of points from a given quadratic (non-linear) equation with normally distributed noise added i.e. .
Fit a linear function to the generated data, print and (or ) and plot the learnt function.
As a helper:
Create the array by concatenating a vector of ones to (use np.concatenate)
Calculate using np.linalg.lstsq
Generate predictions for the fitted function
Plot the learnt linear function alongside the original data points.
Now use scikit-learn to fit the linear model. Do you get the same and (coefficient and intercept)? Is the model overfitting or underfitting?
3.2) Non-linear regression: polynomial fitting
We can extend the class of linear models by considering linear combinations of a set of fixed non-linear functions, or basis functions, (see Bishop section 3.1 for more information on basis functions) applied to the input data. Examples of basis functions include the Gaussian basis function and sigmoid basis function but here we will use a polynomial basis function (see Bishop section 3.1 for more details of basis functions). The purpose for doing this is to transform the data into a higher dimensional space such that a linear function can be fit to it.
To fit a polynomial function, we use the following matrix with the rows consisting of the polynomial basis function.
Notice that the function we are fitting now is non-linear in but we can still apply linear regression in the same way as before because the function is still linear in both and .
Use scikit-learn PolynomialFeatures to fit a second order polynomial to the data, plot the fit line.
Increase the order of the polynomial. When does the model start to overfit? Which order of polynomial would you use?
Wrap up
That's it for lab 1, a revision of Jupyter Notebooks and NumPy as well as an introduction to Scikit-learn and regression. We have covered:
The interactive coding environment of Jupyter Notebooks which allows Markdown and to be added to code. Notebooks show the results of your computation right below your code which is great for quick coding experiments and debugging. These features allow Notebooks to be used to create human-readable documents containing visualiations of results and analysis.
The NumPy package performing a range of array operations, standardised the columns of an array and reshaped an image array into a column vector.
The machine learning library Scikit-learn which can be used to import datasets, perform data preprocessing and then train and evaluate models. We went through a basic machine learning pipeline with the Boston housing dataset.
We finally built a linear regression model from scratch by least squares fitting. Then, we used scikit-learn to fit the same function with only a couple lines of code and used the same methodology to fit a non-linear function after a polynomial transformation.
References
Bishop Pattern Recognition and Machine Learning: Chapter 3 for linear regression,
Materials used to create the lab
University of Bristol's Symbols, Patterns and Signals course
Andrew Ng's Neural Networks and Deep Learning course on Coursera