# runs OK on PY2, not Py3

Understanding Learning Through Linear Regression¶

The following exercises illustrate different aspects of learning theory by using linear regression in a 2-dimensional space as a working learning model. We won't look at the theory of linear regression; we will just use it as a black box in order to understand some fundamental concepts of the theory of learning. Also, we won't look at the varied mathematical detail of the concepts we are introducing. So this is a tutorial without mathematics or analysis.

Now its important to realise that in data science you will rarely use this kind of simple 2-dimensional modelling. The problem is stylised, and you rarely get simple Gaussian noise like this. In many ways this is a trivial, unrealistic example of learning and statistics. However, the simple nature of the material means we can carefully study the different aspects of learning, and discuss some of the behaviour that theory tells us about. So this is excellent material for a tutorial.

Linear regression¶

Now the name linear regression is a bit confusing because the model allows non-linear curves to be fit to data. The linear part is in the structure of the model. We are only considering the 1-dimensional case for simplicity, so we estimate $y$ in terms of $x$. So a simple so-called quadratic would make an estimate for $y$ using the form: $$a*1 + b*x + c*x^2$$ where we have a vector of 3 parameters, $(a,b,c)$, to learn and the function for $y$ is being represented as some linear combination of the basis functions $(1,x,x^2)$. The vector of parameters is called the coefficients for the linear regression.

Below we plot four different versions of this function for different values of the coefficients (that is, $(a,b,c)$).

%matplotlib inline
import matplotlib.pyplot as pl
import numpy 

# generate a list of points between (0,5)
x = numpy.arange(0., 5., 0.1)

def makelabel(a,b,c):
    return '(' + str(a) + ',' + str(b) + ',' + str(c) + ')'

# now plot some curves with different co-efficients
[a,b,c]  = [0,1,0.1]
pl.plot(x, a + b*x + c*x**2, label=makelabel(a,b,c))
[a,b,c]  = [0,2,-0.1]
pl.plot(x, a + b*x + c*x**2, label=makelabel(a,b,c))
[a,b,c]  = [-5,0.5,1]
pl.plot(x, a + b*x + c*x**2, label=makelabel(a,b,c))
[a,b,c]  = [20,0,-1]
pl.plot(x, a + b*x + c*x**2, label=makelabel(a,b,c))

# move the legend out of the way
# the (1.2, 1.2) starts the legend at the location relative to
# the top right corner
pl.legend(bbox_to_anchor=(1.4, 1.0))
pl.ylabel('y')
pl.xlabel('x')
pl.show()

A general purpose 3-term linear model would make an estimate for $y$ using the form: $$a*f(x) + b*g(x) + c*h(x)$$ where we have a vector 3 parameters, $(a,b,c)$, to learn and the function for $y$ is being represented as some linear combination of the basis functions $(f(x),g(x),h(x))$. In the example above this uses the mapping $$ f(x) \mapsto 1 ~,~~~~ g(x) \mapsto x ~,~~~~ h(x) \mapsto x^2 $$

Now we will use two different black box regression tools for this exercise. The first does linear regression to the $N$-th order using powers of $x$, so the $N+1$ basis functions are $(1,x,x^2,...,x^{N+1})$. We call this Regression with Simple Powers. The second does a special version of smoothing or regularisation using Legendre Polynomials that we call Regression with Smoothed Legendre Polynomials. For the purposes of this activity, treat these as black boxes, so it is not your business currently to understand how these two black boxes work.

Building a Dataset¶

Configuration¶

The dataset and display has a number of characteristics. You can change the sample size, the range for $x$, the display (plotted) range for y, noise configuration, and the function being estimated by altering these values.

We restrict the $x$ values to a given range and sample a number of points, uniformly or at random:

xmin : minimum for $x$,
xmax : maximum for $x$,
points : number of data points in the sample, sometimes called the size,

The $y$ will take on any value depending on noise level and the true function:

truefunc() : the "true" function underlying the data,
sigma : scale of noise to add,

When plotting, we restrict $y$ values to a fixed range for coonvenience:

ydisplaymin : when plotting, the minimum $y$ shown,
ydisplaymax : when plotting, the maximum $y$ shown.

Noise-free data¶

First we generate a data set with no noise, so the points lie perfectly on the "true" curve.

# libraries
%matplotlib inline
import matplotlib.pyplot as pl
import sys
import os
#  do this so it finds our own library
sys.path.append(os.getcwd())
from regressiondemo import *

##############################
#      Customize the demo here
#      =======================
# the "true" function of x
def truefunc(x):
    return numpy.sin(x*2.0)*numpy.sqrt(x)/3.3

#  set range for x
#  must be compatible with "true" function
(0,10)
#  when fitting goes wild, need to constrain what y's are plotted
#  must be compatible with "true" function
ydisplaymin = -1.8
ydisplaymax = 1.8

#  don't make points more than 100 as demo is O(points^3)
points = 30

#  noise level (std.dev)
setSigma(0.2)
##############################


x = makeX(points)
#  build the true values matching the sampled data
yt = map(truefunc,x) 

# xts and yts store the "true" function for the purposes of plotting
# these have to be high frequency to make the resultant plot look
# like a smooth curve
xts = makeX(200)
yts = map(truefunc,xts)

# now plot
pl.plot(x, yt, 'bo')

pl.plot(xts, yts,label = 'truth') 
pl.ylabel('y')
pl.xlabel('x')
pl.legend()
pl.show()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-4eac4e22e654> in <module>()
     46 
     47 # now plot
---> 48 pl.plot(x, yt, 'bo')
     49 
     50 

/usr/lib/python3/dist-packages/matplotlib/pyplot.py in plot(*args, **kwargs)
   3097         ax.hold(hold)
   3098     try:
-> 3099         ret = ax.plot(*args, **kwargs)
   3100         draw_if_interactive()
   3101     finally:

/usr/lib/python3/dist-packages/matplotlib/axes/_axes.py in plot(self, *args, **kwargs)
   1372         lines = []
   1373 
-> 1374         for line in self._get_lines(*args, **kwargs):
   1375             self.add_line(line)
   1376             lines.append(line)

/usr/lib/python3/dist-packages/matplotlib/axes/_base.py in _grab_next_args(self, *args, **kwargs)
    301                 return
    302             if len(remaining) <= 3:
--> 303                 for seg in self._plot_args(remaining, kwargs):
    304                     yield seg
    305                 return

/usr/lib/python3/dist-packages/matplotlib/axes/_base.py in _plot_args(self, tup, kwargs)
    279             x = np.arange(y.shape[0], dtype=float)
    280 
--> 281         x, y = self._xy_from_xy(x, y)
    282 
    283         if self.command == 'plot':

/usr/lib/python3/dist-packages/matplotlib/axes/_base.py in _xy_from_xy(self, x, y)
    221         y = np.atleast_1d(y)
    222         if x.shape[0] != y.shape[0]:
--> 223             raise ValueError("x and y must have same first dimension")
    224         if x.ndim > 2 or y.ndim > 2:
    225             raise ValueError("x and y can be no greater than 2-D")

ValueError: x and y must have same first dimension

Adding noise¶

The most common approach is to add the noise to $y$. Here we will add Gaussian noise with a standard deviation of sigma. But you can modify the code to make it Laplacian noise which is a bit wilder.

We will plot the data, the "true" function, and the pure noise (data-"truth") below.

y = addNoise(yt)

# plot noise
pl.plot(x, y-yt, 'bo') # points instead of lines
pl.ylabel('y')
pl.xlabel('x')
pl.suptitle('Noise')
pl.xlim(xmin,xmax)
pl.show()

print 'b'
# plot data and the curve
pl.plot(x, y, 'bo') # for data, plot points instead of lines
pl.ylabel('y')
pl.xlabel('x')
pl.xlim(xmin,xmax)
pl.plot(xts, yts,label = 'truth') # default is green line? 
pl.legend()
pl.suptitle('Data + Function')

b

<matplotlib.text.Text at 0x7fa6706cbd50>

Performing Linear Regression¶

We will use the standard linear regression function available in Python, numpy.polyfit().
What this does is it fits a function $f(x)$ to a sequence of data points $$(x_1,y_1,),...,(x_N,y_N)$$ to minimise the squared error between the data and proposed function, given by $$\sum_{i=1}^N (y_i-f(x_i))^2 $$

The underlying algorithm is not important for us currently. But you have to see how to call the functions in Python. To show how its done, we fit a simple 4-th order polynomial. Most of the code is for the plotting. Note the "true" function is plotted with a thick line, and the simple polynomial fit is overlayed.

# plot data and the truth
pl.plot(xts, yts,label = 'truth', linewidth=2) 
pl.plot(x, y, 'o') 
pl.ylabel('y')
pl.xlabel('x')

order = 4

#  build the fitted poly curve (xts,ys) from order-th regression
ys = linReg(x,y,xts,order)

#  plot fitted curve
pl.plot(xts, ys,label = 'poly ' + str(order), linewidth=1.25 )  
    
#  the y range for the plot has to be altered depending on the data
pl.ylim(ydisplaymin,ydisplaymax)
pl.legend()
pl.suptitle('A simple polynomial fit')

<matplotlib.text.Text at 0x7fa6707bc210>

Low order linear regression doesn't work too well for complex functions.

# so e.g. student exercise, try 3rd, 5th, 6th
# try own data, e.g. height & age from 0 to 15, linear? from 50 to 80, also linear? from 0 to 100, slight curve...?
# try S-curve e.g. World population
# https://en.wikipedia.org/wiki/World_population
# Crickets, patients data