This notebook provides an introduction to using Python for data analysis including storing and loading data, visualising data and performing regression analysis. Also download the Data folder here https://cocalc.com/share/public_paths/9b2872cc1e75acba6ebb8d9c5a1bf4a226ea55ba.
License: GPL3
ubuntu2004
Cookies and Code
Python fits my data!
A key skill for any scientific problem solver is the analysis and visualisation of data. The data from investigations typically require analysis to determine a key measurement and the respective errors. There are typical two main types of software which we can use:
Spreadsheet approach: Excel or Origin Pro.
Pros: graphical interface is attractive to beginners
Cons: must naviagte menus and use left/right clicks to perform analysis, method is also not easily checkable
Code based approach:
Pros: the analysis procedure is a clear list of commands that are reproducible, efficient use of memory,
Cons: some users not confident with programming
Here in this session, we will introduce the code based approach via:
Useful modules for data analysis
Storing and loading data
Visualising data
Linear regression using the least-squares and chi-squares methods.
Before you start, if needed, increase the font size of this document using the magnifying icon above.
You are also free to write code here in this notebook or create your own notebook by clicki
Useful modules for data analysis
Modules are Python files containing blocks of code defining useful functions. Importing modules enable you to extend the functions available for use in your Python code. There are modules for all sorts of things such as data analysis, game development, and web programming. Commonly used modules relevant to data analysis include:
numpy: numerical python which contains functions for creating arrays of data and perfoming maths on them
pandas: data structures (like spread sheets) and data analysis tools
matplotlib.pyplot for visualising data
seaborn for visualising data
scipy.stats: statistical analysis tools
statsmodels: expands on SciPy and integrates with Pandas
These modules are typically written by the community of programmers. You can also package your own functions into a .py file and import them as a module.
Import modules
In Python we use the import
function to load the contents of a module into memory.
When we import the contents of modules via
the functions are then available to us via:
Code creation: Import some modules
Use the cell below to import:
the
numpy
modulecheck that you can call simple
numpy
functions such as:numpy.sqrt(25)
numpy.cos(2*numpy.pi)
numpy.random()
numpy.array([18, 11.1, 120, 5])
Try assigning variables to the output of these commands
x = numpy.array([18, 11.1, 120, 5])
Try performing operations on these variables:
a = np.sqrt(x)
b = x**2
Try printing the different variables to screen to check the results
We can also shorten the module name using an alias:
and now our funtions
We tend to use standard aliases for modules so we can all understand. For example:
numpy: np
matplotlib.pyplot: plt
Code creation: Shorten the module name using an alias
Let's try importing to a short alias name:
Try importing
numpy
to an aliasnp
Run some numpy commands using the
np.functionName()
sytax such as creating 1D arrays:x = np.array([ 2.3, 11.1, 21.55, 66])
y = x**2
Storing data
One of the most important features of numpy is it's arrays. Arrays are multidimensional data sets (mentioned briefly last session). An important distinction between arrays and lists or tuples is that maths can be performed on arrays. Hence, arrays form the backbone of any data analysis in Python.
Arrays can be visualised as rows and columns. Most users work with simple 1D arrays (i.e. a single row) or 2D arrays. For a 2D array of data, visualise horizontal rows and vertical columns. This array below has 3 rows and 4 columns containing 12 elements:
As you can see from this visual example, indexing of a 2D array requires both row and column numbers. Location of data within an array are defined using arrayname[row, column]
syntax. For example, if we give the above array the name mydata
:
mydata[1, 3]
contains the data7
remember indexing starts at zero
mydata[2, 3]
or indeedmydata[-1, -1]
contains the data9
Arrays can also be sliced like other data sets. For example, for the array shown above:
x = mydata[:, 1]
would assign all of row values from the second columnSlicing the array in this way would give a new 1D array of
x = [8, 11, 5]
Creating and manipulating these multidimensional arrays in Python is completed using numpy, for example:
There are many different functions that can be used to create and manipulate arrays. Here are a few of our favorites:
linspace(start, end, n)
: 1D array from start to end with n elementsarange(start, end, step)
: 1D array with values increasing with valuestep
full([nrows, ncols], value)
: 2D array of rows and columns all filled with somevalue
zeros([nrows, ncols])
: an array filled with zerosrandom.random([nrows, ncols])
: an array filled with random floating point numbers from 0-1
Time to tinker: Play with some numpy functions
Use the cell below to define some simply arrays using the functions described above.
for example, use
linespace()
to define a 1D array of valuesChange the sizes, values
Check that you can build some initive understanding of arrays
Try printing different index ranges i.e. slicing
Loading and saving data to file
So far we have worked with data we created within the code. For big or existing data sets we store the data in files. These are simple text files with extensions of:
.txt
: text.csv
: comma separated variables
The choice of extension is not so important, the contents can be the same text based format.
Our choice of using numpy
to store our data provides a number of functions to load and save data:
loadtxt
: Load data from a text file to an arraysavetxt
: Save an array to a text filegenfromtxt
: Load data from a text file, with missing values handled as specified
Load data from a file
We are going to load a dataset of the world population (in billions) over the last 5 decades.
Next we like to verify what data has been loaded into memory. For this you can use the print
function:
or simply use the %whos
magic function to see which variables we have in memory:
In memory, we can see that the data has been stored as a 6X2 numpy array containing 12 elements requiring 96 bytes of the computers RAM.
Since we wish to perform some analysis of the data, we may like to break it up to specific variables such as:
Alternatively, you can do this when loading the data:
Notice here we need to add the flag 'unpack=True' to unpack columns to to the new variables. Again let's check that the data is as expected:
Time to tinker:
Try loading the data from any of the other data files in the same folder as population.csv
.
Saving data to a file
When using numpy, we can save data with the savetxt
function. Here we have specified saving the data to 5 decimal places i.e it's precision. You can find more information about the syntax of the formating fmt
argument here.
Saving data as columns
It is also possible to specify a header line to explain the data or what delimiter to use. For example:
Saving data with a header line
It is also possible to specify a header line to explain the data or what delimiter to use. For example:
Here, we have made some x, y data and stored it as 2 columns using np.c_[]
with a useful header. The \n
is code for "new line" when defining a string.
Once we have our filename defined, check if it exists, and if so what it contains. To do so, use a file manager to navigate to it and open the file. On CoCalc clicking on Files
in the top left will take you to the file manager. As the file extention states, .txt
, this should be a simple text file.
Time to tinker: Save data sets
Tinker with the savedata
function and its options. Try
different data sets, numbers of columns
different headers
different precision
different delimiters
Refer to the function desciption here if needed.
Block 5: Visualising data
As mentioned earlier, there are a couple of options for visualising data in Python. The standard method is to use the matplotlib
library of modules.
A key module from this library is pyplot
:
Now that we have imported this module, we can avail of the numerous plotting function.
The two main plotting functions are:
plt.plot()
plot data as points or line, specify colours, add a label
plt.errorbar( )
plot data with errobars
There are many functions to format plots including:
plt.xlabel()
plt.ylabel()
label axis with some string
plt.xlim()
plt.ylim()
:limit the axis range with
[lower, upper]
values
legend()
: add a legend using the plot labelssavefig()
: save the plot to a file
There are many sources online to describe the wider functions of matplotlib.pyplot
, here is a good one to start with.
Code creation: Creating and visualising data
Create an array combining the following data in three rows:
time = [0, 1, 2, 3, 4]
distance = [0, 3, 6, 9, 12]
speed = [3, 3, 3, 3, 3]
Visualise
distance
vstime
dataVisualise
speed
vstime
datanow tinker with the appearance of the plots (colours, symbols etc)
Block 6: Least squares analysis
In the least squares analysis method, we attempt to fit a linear function y = mx + c to our data points . For this linear function we need to determine the value of the slope, , and the intercept . These are the fit parameters.

For any fit there will be a residual difference, , between the fit value () and our actual data values (). As the name suggests the least squares algorithm trys to minimise the sum of the squares of the residuals :
here the sigma symbol refers to the sum. When this sum of the squares of the residuals, , is the lowest value possible we have the best fit.
At the best fit, the values of slope, m, and the intercept c are calculated using:
where N is the number of data points and
Code creation: Write a least squares function
You are to write a function that performs a least squares analysis.
Using SciPy
Rather than writing your own analysis functions, you can simply use the ready-made functions in Python. Since the analysis is statistical, many of these functions can be found in the scipy.stats
module.
A function for linear regression is linregress. We typically import just this function rather than the whole module:
Once we have the function imported, we can do a quick test:
As you can see linregress
returns 5 variables. The slope and the error are:
the intercept is:
and the value is:
The latter is referred to as the goodness of fit, and for a perfect linear correlation.
Code creation: use the `linregress` function
Perform an analysis of the population data in `population.csv` using `linregress`.print to screen the slope with the error and units
also check the value
is the data linear?
Code creation: use the `linregress` function some more
Perform an analysis of the thermal expansion data in `LengthTempData.txt` using `linregress`.print to screen the slope with the error and units
also check the value
is the data linear?
Block 7: Chi-squared analysis
Including errorbars in the analysis
You may have noticed that the least squares method does not depend on the uncertainties (or error bars) on each point. It simply minimises the sum of the squares between the data point and the fit point : As you can see there is no information used relating to the errorbars, , on each data point. In fact it assumes that the errors on data points are all the same and therefore play no role in the fitting procedure.
To take non-uniform error bars on points into account we must use a chi-squared () approach. This is similar to least squares but minimises: This type of fit is weighted with the errors on each point (). We note that the bigger that point's error is, the smaller is its contribution to the chi squared weighting. Hence this enables you to propagate uncertainties all the way to the fitted parameters and hence your final measurement (e.g. derived from the slope).
For a quick example, we first define different error values for some data set. Here we will use the thermal expansion of a length of metal with increaing temperature as found in Data/LengthTempData.txt
:
Code creation: Visualise raw data with errorbars
Before we perform the analysis, it may be useful to visualise the data.use
plt.errorbar()
functionensure that the data is plotted just as points rather than as a line
label the axes using
plt.xlabel()
andplt.ylabel()
functions
Using the curve_fit
function
To fit the data using a chi-squared method, we can use the curve_fit
function.This is contained in the scipy.optimise
module.
Code creation: import the `curve_fit function`
The curve_fit
funciton is contained in the scipy.optimize
module.
import the function
The syntax for using curve_fit
is:
There are 4 arguments:
fitfunction
: the name of some function we make that can fit the datathis is a string
x
: the independent data i.e. what we variedthis can be a list or array of numbers
y
: the dependent data i.e. what we measuredthis can be a list or array of numbers
errors
: the measured uncertainty iny
this can be a list or array of numbers
Of course, we can called these variables whatever we wish.
Now, the argument fitfunction
is the name of a function we need to make. For a linear fit we define this function as
Code creation: create a fit function
Define (def
) a function with some name i.e linearfit
give it inputs
x
,m
andc
from which it
returns
a valuemx+c
.
Now we have a linear function defines and we have our data, let us now perform the chi-squared analysis:
Code creation: call the `curve_fit` function
Here use the variable names we have already defined:
linearfit
: name of our fit functiontemp
: our independent datalength
: our independent dataerror
: uncertainties on 'length'
Processing the analysis results
If sucessful, calling the function will return 2 new sets of numbers:
[2.81164900e-05, 1.11547121e+00]
[ [ 1.84166422e-12, -1.47264501e-10], [-1.47264501e-10, 1.75281421e-08] ]
These are 2 arrays. The first contains the fit parameters and the second contains their variances:
slope, intercept
a 2x2 covarient matrix i.e. a measure of the uncertainty on the fit parameters (slope and intercept)
Because we will want to use these analysis results, it is best to assign new variables to the output of curve_fit
.
Code creation: store the output of `curve_fit
Assign 2 new variables to the output of curve_fit
.
typically names are
popt
andpcov
The curve_fit
function now outputs two variables which we have called popt
and pcov
. The optimised fit coefficients are stored in the popt
list and their respective errors can be calculated using the values stored in the pcov
array.
After calling the curve_fit
function we will want to access data stored in the popt
and pcov
variables.
Code creation: get the analysis results
Assign variables to the optimised fit parameters stored in popt
.
use the correct index to select the
slope
andintercept
print the results to screen with units
All statistical analysis results have statistical errors. If our fit is perfect, i.e. our data is very linear, these errors should be very small. The statistical measure of the error in fit values is in terms of standard deviation referred to as sigma or . This represents how spread out the possible fit values are. The curve_fit
function returns variances () rather than standard deviations.
Variance measures the variation of a single random variable (like slope), whereas the covariances are a measure of how much two random variables vary together (like the slope and the intercept). These statistical variances of slope
and intercept
are contained in pcov
. The diagonals of pcov
provide the variances of the fit parameter. The other entries are the covariances.
\begin{Bmatrix} \sigma^2(x,x) & \sigma^2(x, y) \ \sigma^2(y,x) & \sigma^2(y, y) \end{Bmatrix}
We are just interested in the variances. The diagonal, , contain the slope and intercept variances respectively:
\begin{Bmatrix} 1.84166422e-12 & -1.47264501e-10 \ -1.47264501e-10 & 1.75281421e-08 \end{Bmatrix}
We can use the variances to calculate the errors (one standard deviation i.e. 1 sigma or ) on the fit parameters. Since the variance is , we take the square root to determine the error sigma
.
Code creation: calculate the statistical errors
Calculate the statistical errors for slope and the intercept. Use the numpy function np.sqrt()
if needed.
print to screen the values of slope and the intercept together with these errors and units
In the context of this data, the analysis provides the following results:
the slope is the thermal expansion coefficient:
m/C.
so this tells us that for every degree of heating, the metal expands by m
comparing with known values indicates that the metal may be lead!
the intercept is equivalent to the length of the metal at C
which is m.
the relative error on is or
Visualising the results
Now that we have the analysis results we can finish by visualising this together with the measured data for comparison. In order to plot a fit line, we need to create the fit data.
Code creation: create the fit data
xfit
: an array of x-values from tochoose a starting point
choose a end point just bigger than your maximum measured value
use a numpy function such as
np.linspace
ornp.arange
to create thexfit
array
yfit
: and array of y-valuescalculate using
linearfit
functionyfit = linearfit(xfit, slope, intercept)
Now that we have the fit data, we should compare this on a graph with the measured data.
Code creation: visualise measured data and the linear fit
plot the measured data as before using
plt.errorbar()
then plot the fit data using
plt.plot()
on the same graphadd labels with units to the axes
save the graph as a pdf or png file
open the file for viewing
Consolidate code
In your own separate notebook, gather all the code you developed to perform the chi-squared or least squares analysis. Edit the code so that it is all contained in a single cell.
Test that the entire analysis programme works.
For example: