Lectures on scientific computing with python, as IPython notebooks, by J. R. Johansson
Numpy - multidimensional data arrays
J.R. Johansson (jrjohansson at gmail.com)
The latest version of this IPython notebook lecture is available at http://github.com/jrjohansson/scientific-python-lectures.
The other notebooks in this lecture series are indexed at http://jrjohansson.github.io.
Introduction
The numpy
package (module) is used in almost all numerical computation using Python. It is a package that provide high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good.
To use numpy
you need to import the module, using for example:
In the numpy
package the terminology used for vectors, matrices and higher-dimensional data sets is array.
Creating numpy
arrays
There are a number of ways to initialize new numpy arrays, for example from
a Python list or tuples
using functions that are dedicated to generating numpy arrays, such as
arange
,linspace
, etc.reading data from files
From lists
For example, to create new vector and matrix arrays from Python lists we can use the numpy.array
function.
The v
and M
objects are both of the type ndarray
that the numpy
module provides.
The difference between the v
and M
arrays is only their shapes. We can get information about the shape of an array by using the ndarray.shape
property.
The number of elements in the array is available through the ndarray.size
property:
Equivalently, we could use the function numpy.shape
and numpy.size
So far the numpy.ndarray
looks awefully much like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type?
There are several reasons:
Python lists are very general. They can contain any kind of object. They are dynamically typed. They do not support mathematical functions such as matrix and dot multiplications, etc. Implementing such functions for Python lists would not be very efficient because of the dynamic typing.
Numpy arrays are statically typed and homogeneous. The type of the elements is determined when the array is created.
Numpy arrays are memory efficient.
Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of
numpy
arrays can be implemented in a compiled language (C and Fortran is used).
Using the dtype
(data type) property of an ndarray
, we can see what type the data of an array has:
We get an error if we try to assign a value of the wrong type to an element in a numpy array:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-a09d72434238> in <module>()
----> 1 M[0,0] = "hello"
ValueError: invalid literal for long() with base 10: 'hello'
If we want, we can explicitly define the type of the array data when we create it, using the dtype
keyword argument:
Common data types that can be used with dtype
are: int
, float
, complex
, bool
, object
, etc.
We can also explicitly define the bit size of the data types, for example: int64
, int16
, float128
, complex128
.
Using array-generating functions
For larger arrays it is inpractical to initialize the data manually, using explicit python lists. Instead we can use one of the many functions in numpy
that generate arrays of different forms. Some of the more common are:
arange
linspace and logspace
mgrid
random data
diag
zeros and ones
File I/O
Comma-separated values (CSV)
A very common file format for data files is comma-separated values (CSV), or related formats such as TSV (tab-separated values). To read data from such files into Numpy arrays we can use the numpy.genfromtxt
function. For example,
1800 1 1 -6.1 -6.1 -6.1 1
1800 1 2 -15.4 -15.4 -15.4 1
1800 1 3 -15.0 -15.0 -15.0 1
1800 1 4 -19.3 -19.3 -19.3 1
1800 1 5 -16.8 -16.8 -16.8 1
1800 1 6 -11.4 -11.4 -11.4 1
1800 1 7 -7.6 -7.6 -7.6 1
1800 1 8 -7.1 -7.1 -7.1 1
1800 1 9 -10.1 -10.1 -10.1 1
1800 1 10 -9.5 -9.5 -9.5 1
Using numpy.savetxt
we can store a Numpy array to a file in CSV format:
7.787257639287014088e-01 4.004357670697732408e-01 6.625401863466899854e-01
6.041006328761111543e-01 4.791373994963619154e-01 8.237105968088237473e-01
9.685631757740569281e-01 1.545964379103705877e-01 9.608239852111523094e-01
0.77873 0.40044 0.66254
0.60410 0.47914 0.82371
0.96856 0.15460 0.96082
Numpy's native file format
Useful when storing and reading back numpy array data. Use the functions numpy.save
and numpy.load
:
random-matrix.npy: data
More properties of the numpy arrays
Manipulating arrays
Indexing
We can index elements in an array using square brackets and indices:
If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array)
The same thing can be achieved with using :
instead of an index:
We can assign new values to elements in an array using indexing:
Index slicing
Index slicing is the technical name for the syntax M[lower:upper:step]
to extract part of an array:
Array slices are mutable: if they are assigned a new value the original array from which the slice was extracted is modified:
We can omit any of the three parameters in M[lower:upper:step]
:
Negative indices counts from the end of the array (positive index from the begining):
Index slicing works exactly the same way for multidimensional arrays:
Fancy indexing
Fancy indexing is the name for when an array or list is used in-place of an index:
We can also use index masks: If the index mask is an Numpy array of data type bool
, then an element is selected (True) or not (False) depending on the value of the index mask at the position of each element:
This feature is very useful to conditionally select elements from an array, using for example comparison operators:
Functions for extracting data from arrays and creating arrays
where
The index mask can be converted to position index using the where
function
diag
With the diag function we can also extract the diagonal and subdiagonals of an array:
take
The take
function is similar to fancy indexing described above:
But take
also works on lists and other objects:
choose
Constructs an array by picking elements from several arrays:
Linear algebra
Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations, like matrix-matrix multiplication.
Scalar-array operations
We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.
Element-wise array-array operations
When we add, subtract, multiply and divide arrays with each other, the default behaviour is element-wise operations:
If we multiply arrays with compatible shapes, we get an element-wise multiplication of each row:
Matrix algebra
What about matrix mutiplication? There are two ways. We can either use the dot
function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments:
Alternatively, we can cast the array objects to the type matrix
. This changes the behavior of the standard arithmetic operators +, -, *
to use matrix algebra.
If we try to add, subtract or multiply objects with incomplatible shapes we get an error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-100-995fb48ad0cc> in <module>()
----> 1 M * v
/Users/rob/miniconda/envs/py27-spl/lib/python2.7/site-packages/numpy/matrixlib/defmatrix.pyc in __mul__(self, other)
339 if isinstance(other, (N.ndarray, list, tuple)) :
340 # This promotes 1-D vectors to row vectors
--> 341 return N.dot(self, asmatrix(other))
342 if isscalar(other) or not hasattr(other, '__rmul__') :
343 return N.dot(self, other)
ValueError: shapes (5,5) and (6,1) not aligned: 5 (dim 1) != 6 (dim 0)
See also the related functions: inner
, outer
, cross
, kron
, tensordot
. Try for example help(kron)
.
Array/Matrix transformations
Above we have used the .T
to transpose the matrix object v
. We could also have used the transpose
function to accomplish the same thing.
Other mathematical functions that transform matrix objects are:
Hermitian conjugate: transpose + conjugate
We can extract the real and imaginary parts of complex-valued arrays using real
and imag
:
Or the complex argument and absolute value
Matrix computations
Inverse
Determinant
Data processing
Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays.
For example, let's calculate some properties from the Stockholm temperature dataset used above.
mean
The daily mean temperature in Stockholm over the last 200 years has been about 6.2 C.
standard deviations and variance
min and max
sum, prod, and trace
Computations on subsets of arrays
We can compute with subsets of the data in an array using indexing, fancy indexing, and the other methods of extracting data from an array (described above).
For example, let's go back to the temperature dataset:
1800 1 1 -6.1 -6.1 -6.1 1
1800 1 2 -15.4 -15.4 -15.4 1
1800 1 3 -15.0 -15.0 -15.0 1
The dataformat is: year, month, day, daily average temperature, low, high, location.
If we are interested in the average temperature only in a particular month, say February, then we can create a index mask and use it to select only the data for that month using:
With these tools we have very powerful data processing capabilities at our disposal. For example, to extract the average monthly average temperatures for each month of the year only takes a few lines of code:
Calculations with higher-dimensional data
When functions such as min
, max
, etc. are applied to a multidimensional arrays, it is sometimes useful to apply the calculation to the entire array, and sometimes only on a row or column basis. Using the axis
argument we can specify how these functions should behave:
Many other functions and methods in the array
and matrix
classes accept the same (optional) axis
keyword argument.
Reshaping, resizing and stacking arrays
The shape of an Numpy array can be modified without copying the underlaying data, which makes it a fast operation even for large arrays.
We can also use the function flatten
to make a higher-dimensional array into a vector. But this function create a copy of the data.
Adding a new dimension: newaxis
With newaxis
, we can insert new dimensions in an array, for example converting a vector to a column or row matrix:
Stacking and repeating arrays
Using function repeat
, tile
, vstack
, hstack
, and concatenate
we can create larger vectors and matrices from smaller ones:
tile and repeat
concatenate
hstack and vstack
Copy and "deep copy"
To achieve high performance, assignments in Python usually do not copy the underlaying objects. This is important for example when objects are passed between functions, to avoid an excessive amount of memory copying when it is not necessary (technical term: pass by reference).
If we want to avoid this behavior, so that when we get a new completely independent object B
copied from A
, then we need to do a so-called "deep copy" using the function copy
:
Iterating over array elements
Generally, we want to avoid iterating over the elements of arrays whenever we can (at all costs). The reason is that in a interpreted language like Python (or MATLAB), iterations are really slow compared to vectorized operations.
However, sometimes iterations are unavoidable. For such cases, the Python for
loop is the most convenient way to iterate over an array:
When we need to iterate over each element of an array and modify its elements, it is convenient to use the enumerate
function to obtain both the element and its index in the for
loop:
Vectorizing functions
As mentioned several times by now, to get good performance we should try to avoid looping over elements in our vectors and matrices, and instead use vectorized algorithms. The first step in converting a scalar algorithm to a vectorized algorithm is to make sure that the functions we write work with vector inputs.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-165-6658efdd2f22> in <module>()
----> 1 Theta(array([-3,-2,-1,0,1,2,3]))
<ipython-input-164-9a0cb13d93d4> in Theta(x)
3 Scalar implemenation of the Heaviside step function.
4 """
----> 5 if x >= 0:
6 return 1
7 else:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
OK, that didn't work because we didn't write the Theta
function so that it can handle a vector input...
To get a vectorized version of Theta we can use the Numpy function vectorize
. In many cases it can automatically vectorize a function:
We can also implement the function to accept a vector input from the beginning (requires more effort but might give better performance):
Using arrays in conditions
When using arrays in conditions,for example if
statements and other boolean expressions, one needs to use any
or all
, which requires that any or all elements in the array evalutes to True
:
Type casting
Since Numpy arrays are statically typed, the type of an array does not change once created. But we can explicitly cast an array of some type to another using the astype
functions (see also the similar asarray
function). This always create a new array of new type:
Further reading
http://scipy.org/NumPy_for_Matlab_Users - A Numpy guide for MATLAB users.
Versions
Software | Version |
---|---|
Python | 2.7.10 64bit [GCC 4.2.1 (Apple Inc. build 5577)] |
IPython | 3.2.1 |
OS | Darwin 14.1.0 x86_64 i386 64bit |
numpy | 1.9.2 |
Sat Aug 15 11:02:09 2015 JST |