In numpy, arrays are layed out in memory such that, if we iterate over neighboring elements, the rightmost index changes the fastest. This is called row-major order, and is used by other languages such as C++, Eigen and PyTorch. By contrast, other languages (such as Julia, Matlab, R and Fortran) use column-major order. See below for an illustration of the difference.
(Source: https://commons.wikimedia.org/wiki/File:Row_and_column_major_order.svg)
Thus in numpy, for speed reasons, we should always write loops like this:
For similar reasons, data matrices are usually stored in the form , where is the batchsize (number of examples), so that we can efficiently extract minibatches by slicing blocks of consecutive memory.
Band diagonal
In numpy, the command A * B computes the elementwise multiplication of arrays or tensors A and B. If these arrays have different shapes, they will be automatically converted to have compatible shapes by implictly replicating certain dimensions; this is called broadcasting. The following conversion rules are applied in order:
If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on the left side. For example, a scalar will be converted to a vector, and a vector to a matrix with one row.
If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape, by replicating the corresponding contents.
If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
Figure made by broadcasting_fig.py by Jake VanderPlas.
Einstein summation lets us write formula such as inputs -> outputs, which name the dimensions of the input tensor and output tensors; dimensions which are not named in the output are summed over - this is called tensor contraction.
Now consider einsum with a single tensor.
Now consider einsum with 2 tensors.
As a more complex example, suppose we have a 3d tensor where indexes examples in the batch, indexes locations in the sequence, and indexes words in a one-hot representation. Let be an embedding matrix that maps sparse one-hot vectors to dense vectors in . We can convert the batch of sequences of one-hots to a batch of sequences of embeddings as follows: We can compute the sum of the embedding vectors for each sequence (to get a global representation of each bag of words) as follows: Finally we can pass each sequence's vector representation through another linear transform to map to the logits over a classifier with labels: In einsum notation, we have We sum over and because those indices occur twice on the RHS. We sum over because that index does not occur on the LHS.
Example: Diagonalizing a rotation matrix
As an example, let us construct by combining a rotation by 45 degrees about the axis, a scaling by ParseError: KaTeX parse error: Undefined control sequence: \diag at position 1: \̲d̲i̲a̲g̲(1,2,3), followed by another rotation of -45 degrees. These components can be recovered from using EVD, as we show below.
Example: checking for positive definitness
A symmetric matrix is positive definite iff all its eigenvalues are positive.
Power method
Linear function: multi-input, scalar output.
Linear function: multi-input, multi-output.
Quadratic form.
If you look at the source code for scipy.linalg.lstsq, you will see that it just a Python wrapper to some LAPACK code written in Fortran. LAPACK offers multiple methods for solving linear systems, including gelsd
(default), gelss
, and gelsy
. The first two methods use SVD, the latter uses QR decomposition.
A lot of numpy and scipy functions are just wrappers to legacy libraries, written in Fortran or C++, since Python itself is too slow for efficient numerical computing. Confusingly, sometimes numpy and scipy offer different wrappers to the same underlying LAPACK functions, but with different interfaces. For example, as of 2018, jnp.linalg.lstsq
and scipy.linalg.lstsq
have been modified to behave the same. However, jnp.linalg.qr
and scipy.linalg.qr
have slightly different optional arguments and may give different results.
Jax does not yet implement lstsq, but does implement some of the underlying methods here. Unlike the legacy code, this can run fast on GPUs and TPUs.