Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

| Download

Project: Computational Tutorials

Path: LinearRegression.ipynb

Views: ¹¹

Visibility: Unlisted (only visible to those who know the link)

Image: ubuntu2204

Kernel: Python 3 (system-wide)

Linear Regression Using Python

Theory

Suppose that you want to fit a set data points $(x_i,y_i)$ , where $i = 1,2,\ldots,N$ , to a straight line, $y=ax+b$ . The process of determining the best-fit line is called linear regression. This involves choosing the parameters $a$ and $b$ to minimize the sum of the squares of the differences between the data points and the linear function. How the difference are defined varies. If there are only uncertainties in the y direction, then the differences in the vertical direction (the gray lines in the figure below) are used. If there are uncertainties in both the $x$ and $y$ directions, the orthogonal (perpendicular) distances from the line (the dotted red lines in the figure below) are used.

Image from http://blog.rtwilson.com/orthogonal-distance-regression-in-python/

For the case where there are only uncertainties in the y direction, there is an analytical solution to the problem. If the uncertainty in $y_i$ is $\sigma_i$ , then the difference squared for each point is weighted by $w_i=1/\sigma_i^2$ . If there are no uncertainties, each point is given an equal weight of one. The function to be minimized with respect to variations in the parameters, $a$ and $b$ , is

\chi^2 = \sum_{i=1}^N w_i \left[y_i - \left(ax_i+b\right)\right]^2.

The analytical solutions for the best-fit parameters that minimize $\chi^2$ (see pp. 181-189 of An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements by John R. Taylor, for example) are

a=\frac{\sum w_i \sum w_i x_i y_i - \sum w_i x_i \sum w_i y_i}{\sum w_i \sum w_i x_i^2 - \left(\sum w_i x_i\right)^2}

and $b=\frac{\sum w_i y_i \sum w_i x_i^2 - \sum w_i x_i \sum w_i x_i y_i}{\sum w_i \sum w_i x_i^2 - \left(\sum w_i x_i\right)^2}.$ The uncertainties in the parameters are

\sigma_a=\sqrt{\frac{\sum w_i}{\sum w_i \sum w_i x_i^2 - \left(\sum w_i x_i\right)^2}}

and $\sigma_b=\sqrt{\frac{\sum w_i x_i^2}{\sum w_i \sum w_i x_i^2 - \left(\sum w_i x_i\right)^2}}.$ All of the sums in the four previous equations are over $i$ from 1 to $N$ .

For the case where there are uncertainties in both $x$ and $y$ , there is no analytical solution. One method that can be used in that case is called orthogonal distance regression (ODR).

Implementation in Python

The linear_fit function that performs these calculations is defined in the file "fitting.py" which must be located in the same directory as the Python program using it. If there are no uncertainties or only uncertainties in the $y$ direction, the analytical expressions above are used. If there are uncertainties in both the $x$ and $y$ directions, the scipy.odr module is used.

An example of performing a linear fit with uncertainties in the $y$ direction is shown below. The first command imports the function. Arrays containing the data points ( $x$ and $y$ ) are sent to the function. If only one array of uncertainties (called $yerr$ in the example) is sent, they are assumed to be in the $y$ direction. In the example, the array function (from the pylab library) is used to turn lists into arrays. It is also possible to read data from a file. The fitting function returns the best-fit parameters (called $a$ and $b$ in the example), their uncertainties (called $sa$ and $sb$ in the example), the reduced chi squared, and the degrees of freedom (called $rchi2$ and $dof$ in the example). The last two quantities are defined in the next section.

In [3]:

from fitting import linear_fit
import pylab as pl

# Turn lists of numbers into arrays
x = pl.array([0.0, 2.0, 4.0, 6.0, 8.0])
y = pl.array([1.1, 1.9, 3.2, 4.0, 5.9])
yerr = pl.array([0.1, 0.2, 0.1, 0.3, 0.3])

a, b, sa, sb, rchi2, dof = linear_fit(x, y, yerr)

print('y = ax + b')
print('a = ', a, ' +/- ', sa)
print('b = ', b, ' +/- ', sb)

Out[3]:

results of linear_fit:
   reduced chi squared =  1.8529522376833425
   degrees of freedom =  3
y = ax + b
a =  0.5475554719819483  +/-  0.027442716158878274
b =  1.0352012034599476  +/-  0.09253847679669566

An example of performing a linear fit with uncertainties in both the $x$ and $y$ direction is shown below. Arrays containing the data points ( $x2$ and $y2$ ) and their uncertainties ( $y2err$ and $x2err$ ) are sent to the function. Note the order of the uncertainties! The uncertainty in $x$ is optional, so it is second. This is also consistent with the errorbar function (see below).

In [4]:

from fitting import linear_fit
import pylab as pl

# Test data and uncertainties
x2 = pl.array([0.0, 0.9, 1.8, 2.6, 3.3, 4.4, 5.2, 6.1, 6.5, 7.4])
x2err = pl.array([0.03162, 0.03162, 0.04472, 0.03536, 0.07071, 0.1118, 0.1291, 0.2236, 0.7454, 1.0])
y2 = pl.array([5.9, 5.4, 4.4, 4.6, 3.5, 3.7, 2.8, 2.8, 2.4, 1.5])
y2err = pl.array([1.0, 0.7454, 0.5, 0.3536, 0.2236, 0.2236, 0.1118, 0.1118, 0.1, 0.04472])

a2, b2, sa2, sb2, rchi2_2, dof2 = linear_fit(x2,y2,y2err,x2err)

print('y = ax + b')
print('a = ', a2, ' +/- ', sa2)
print('b = ', b2, ' +/- ', sb2)

Out[4]:

results of linear_fit:
   reduced chi squared =  1.5306287931648985
   degrees of freedom =  8
y = ax + b
a =  -0.4804893050019209  +/-  0.07149325586292532
b =  5.477489292698448  +/-  0.36500292368632054

Intrepeting the Results

Plotting data with error bars and a best-fit line together gives a rough idea of whether or not the fit is good. If the line passes within most of the error bars, the fit is probably pretty good. The first line below makes a list of 100 points between the minimum and maximum values of $x$ in the data. The second line finds the value of $y$ for the best fit line.

In [5]:

xf = pl.linspace(min(x),max(x),100)
yf = a*xf + b
pl.figure()
pl.scatter(x,y,label='data')  # plot data
pl.errorbar(x, y, yerr, ls='None', capsize=2)  # add error bars to data
pl.plot(xf,yf,"r-",label='best-fit line')  # plot best-fit line
pl.legend(loc='upper left')
pl.show()

Out[5]:

When plotting the data with errorbars in both directions, note that the array of uncertainties in the $y$ direction ( $y2err$ ) comes before the array of uncertainties in the $x$ direction ( $x2err$ ) in the errorbar function. In this case, points were not plotted for the data because they would hide the smallest error bars.

In [6]:

x2f = pl.linspace(min(x2),max(x2),100)
y2f = a2*x2f + b2
pl.figure()
pl.errorbar(x2, y2, y2err, x2err, ls='None', label='data', capsize=2)  # plot data w/o points since some error bars are small
pl.plot(x2f,y2f,"r-",label='best-fit line')  # plot best-fit line
pl.legend(loc='upper right')
pl.show()

Out[6]:

The reduced chi squared and the degrees of freedom can also be used to judge the goodness of the fit. If $N$ is the number of data points and $C$ is the number of parameters (or constraints) in the fit, the number degrees of freedom is

d = N - C.

For a linear fit, $C = 2$ because there two parameters for a line. The reduced chi squared is defined as

\tilde{\chi}^{\, 2} = \frac{\chi^2}{d}.

According to Taylor (p. 271), “If we obtain a value of $\tilde{\chi}^{\, 2}$ of order one or less, then we have no reason to doubt our expected distribution; if we obtain a value of $\tilde{\chi}^{\, 2}$ much larger than one, our expected distribution is unlikely to be correct.” For an observed value (from fitting data) of the reduced chi square ( $\tilde{\chi}^{\, 2}_o$ ), you can look up the probability of randomly getting a larger $\tilde{\chi}^{\, 2}$ with $d$ degrees of freedom on the table below (from Appendix D of Taylor’s book). A typical standard is to reject a fit if

Prob_d\left(\tilde{\chi}^{\, 2} \ge \tilde{\chi}^{\, 2}_o \right) < 5\%.

In other words, if the reduced chi squared for a fit is unlikely to occur randomly, then the fit is not a good one. In the first example above, five data points are fit with a line and $\tilde{\chi}^{\, 2} = 1.85$ . Since $d = 5 - 2 = 3$ , the table gives

Prob_d\left(\tilde{\chi}^{\, 2} \ge \tilde{\chi}^{\, 2}_o \right) \approx 14\%,

and there is no reason to reject the fit.

From Error Analysis by John Taylor

Linearization

Some other functions with two parameters can be linearized so that a linear fit can be used. There are two common examples. First, if data is expected to fit an exponential function

y(x)=Ce^{Dx},

then

\ln y = Dx + \ln C.

Therefore, $\ln y$ vs. $x$ can be fit with a line in order to determine $C$ and $D$ . Don’t forget that the uncertainty of $\ln y$ is not the same as the uncertainty of $y$ . Also, the linear fit gives $\ln C$ and its uncertainty, which is not the same as the uncertainty of $C$ . The uncertainties must be propagated correctly. Second, if data is expected to fit power law of the form

y(x) = kx^n,

then

\log y = n \log x + \log k.

Note that the base 10 logarithm is typically used for power laws. In this case, $\log y$ vs. $\log x$ can be fit with a line. Don't forget that the uncertainties of $\log x$ and $\log y$ are not the same as the uncertainties of $x$ and $y$ . The power $n$ is the slope, so its uncertainty comes directly from the fit. However, the uncertainty of the $\log k$ from the y intercept is not the same as the uncertainty of $k$ . That uncertainty must be propagated.

Additional Documentation

More information is available at: https://docs.scipy.org/doc/scipy/reference/odr.html

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Linear Regression Using Python

Theory

Implementation in Python

Intrepeting the Results

Linearization

Additional Documentation

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Linear Regression Using Python

Theory

Implementation in Python

Intrepeting the Results

Linearization

Additional Documentation

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.