Path: blob/main/notebooks/quantum-machine-learning/training.ipynb
3855 views
Training parameterized quantum circuits
In this section we will have a closer look at how to train circuit-based models using gradient based methods. We'll look at the restrictions these models have, and how we might overcome them.
Introduction
Like classical models, we can train parameterized quantum circuit models to perform data-driven tasks. The task of learning an arbitrary function from data is mathematically expressed as the minimization of a cost or loss function ParseError: KaTeX parse error: Undefined control sequence: \class at position 3: f(\̲c̲l̲a̲s̲s̲{theta_vec}{\ve…, also known as the objective function, with respect to the parameter vector ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{theta_vec}{\ve…. Generally, when training a parameterized quantum circuit model, the function we are trying to minimise is the expectation value: ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{_bra_psi_theta…
There are many different types of algorithms that we can use to optimise the parameters of a variational circuit, ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{u_theta}{\math…, (gradient-based, evolutionary, and gradient-free methods). In this course, we will be discussing gradient-based methods.
Gradients
Say we have a function ParseError: KaTeX parse error: Undefined control sequence: \class at position 3: f(\̲c̲l̲a̲s̲s̲{theta_vec}{\ve…, and we have access to the gradient of the function, , starting from an initial point. The simplest way to minimize the function is to update the parameters towards the direction of steepest descent of the function: ParseError: KaTeX parse error: Undefined control sequence: \class at position 35: …\vec\theta_n - \̲c̲l̲a̲s̲s̲{eta}{\eta}\cla…, where is the learning rate - a small, positive hyperparameter controlling the size of the update. We continue doing this until we converge to a local minimum of the function, ParseError: KaTeX parse error: Undefined control sequence: \class at position 3: f(\̲c̲l̲a̲s̲s̲{theta_vec_star….
This technique is called gradient descent or vanilla gradient descent, since it's the plain gradient, we aren't doing anything special to it.
Qiskit provides different methods to compute gradients of expectation values, let's explore them!
First, we need to define our parameterized state, ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{ket_psi_theta}…. In this page, ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{u_theta}{U(\ve… is the Qiskit RealAmplitudes circuit on two qubits:
Next we need to define a Hamiltonian, let's use: ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{h_hat}{\hat H}…
Putting them together to make the expectation value: ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{_bra_psi_theta…
Next we write a function to simulate the measurement of the expectation value:
To make things concrete, let's fix a point and an index and ask: What's the derivative of the expectation value with respect to parameter at point ?
ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{partial_deriva…We'll choose a random point and index (remember we start counting from 0).
Finite difference gradients
Arguably the simplest way to approximate gradients is with a finite difference scheme. This works independently of the function's inner, possibly very complex, structure.
If we are interested in estimating the gradient at , we can choose some small distance and calculate and and take the difference between the two function values, divided by the distance: .
Instead of doing this manually, we can use Qiskit's Gradient class for this.
Finite difference gradients can be volatile on noisy functions and using the exact formula for the gradient can be more stable. This can be seen above, as although these two calculations make use of the same formula, they yield different results due to the shot noise. In the example image below, we can see the "Noisy finite difference gradient" actually points in the opposite direction of the true gradient!
Analytic gradients
Analytics gradients evaluate the analytic formula for the gradients. In general, this is fairly difficult as we have to do a manual calculation, but for circuit based gradients, Reference 1 introduces a nice theoretical result that gives an easy formula for calculating gradients: The parameter shift rule.
For a simple circuit consisting of only Pauli rotations, without any coefficients, then this rule says that the analytic gradient is:
which is very similar to the equation for finite difference gradients.
Let's try calculate it by hand:
And using the Qiskit Gradient class:
We see that the calculated analytic gradient is fairly similar to the calculated finite difference gradient.
Now that we know to calculate gradients, let's try optimizing the expectation value!
First we fix an initial point for reproducibility.
Similar to how we had a function to evaluate the expectation value, we'll need a function to evaluate the gradient.
To compare the convergence of the optimizers, we can keep track of the loss at each step by using a callback function.
And now we start the optimization and plot the loss!
Natural gradients
We see in the above example that we are able to find the minimum of the function using gradient descent. However gradient descent is not always the best strategy.

For example, if we look at the diagram on the left, given the initial point on the edge of the loss landscape, and learning rate , we can approach the minimum in the centre.
However, looking at the diagram on the right, where the loss landscape has been squashed in the -dimension, we see that using the same initial point and learning rate, we can't find the minimum. This is because we're incorrectly assuming the loss landscape varies at the same rate with respect to each parameter. Both models show the same Euclidean distance between , but this is insufficient for Model B because this metric fails to capture the relative sensitivities.
The idea of natural gradients is to change the way we determine from , by considering the sensitivity of the model. In vanilla gradients, we used the Euclidean distance between them: ParseError: KaTeX parse error: Undefined control sequence: \class at position 5: d = \̲c̲l̲a̲s̲s̲{euclidean_dist…, but we saw that this doesn't take the loss landscape into account. With natural gradients, we instead use a distance that depends on our model: .

This metric is called the Quantum Fisher Information, , and allows us to transform the steepest descent in the Euclidean parameter space to the steepest descent in the model space. This is called the Quantum Natural Gradient, and is introduced in Reference 2, where ParseError: KaTeX parse error: Undefined control sequence: \class at position 38: …c\theta_n-\eta \̲c̲l̲a̲s̲s̲{}{g^{-1}}(\vec….

We can evaluate the natural gradient in Qiskit using the NaturalGradient instead of the Gradient.
Analogous to the function that computes gradients, we can now write a function that evaluates the natural gradients.
And as you can see they do indeed differ!
Let's look at how this influences the convergence.
This looks great! We can see that the quantum natural gradient approaches the target faster than vanilla gradient descent. However, this comes at the cost of needing to evaluate many more quantum circuits.
Simultaneous Perturbation Stochastic Approximation
Looking at our function as a vector, if we want to evaluate the gradient , we need to calculate the partial derivation of with respect to each parameter, meaning we would need function evaluations for parameters to calculate the gradient.
Simultaneous Perturbation Stochastic Approximation (SPSA) is an optimization technique where we randomly sample from the gradient, to reduce the number of evaluations. Since we don't care about the exact values but only about convergence, an unbiased sampling should on average work equally well.
In practice, while the exact gradient follows a smooth path to the minimum, SPSA will jump around due to the random sampling, but it will converge, given the same boundary conditions as the gradient.
And how does it perform? We use the SPSA algorithm in Qiskit.
We can see that SPSA basically follows the gradient descent curve, and at a fraction of the cost!
We can do the same for natural gradients as well, as described in Reference 3. We'll skip the details here, but the idea is to sample not only from the gradient, but to extend this to the quantum Fisher information and thus to the natural gradient.
Qiskit implements this as the QNSPSA algorithm. Let's compare its performance:
We can see that QNSPSA somewhat follows the natural gradient descent curve.
The vanilla and natural gradient costs are linear and quadratic in terms of the number of parameters, while the costs for SPSA and QNSPSA are constant, i.e. independent of the number of parameters. There is the small offset between the costs for SPSA and QNSPSA as more evaluations are required to approximate the natural gradient.
Quick quiz
Training in practice
In this era of near-term quantum computing, circuit evaluations are expensive, and readouts are not perfect due to the noisy nature of the devices. Therefore in practice, people often resort to using SPSA. To improve convergence, we don't use a constant learning rate, but an exponentially decreasing one. The diagram below shows the typical convergence between a constant learning rate (dotted lines) versus an exponentially decreasing one (solid lines). We see that the convergence for a constant learning rate is smooth decreasing line, while the convergence for an exponentially decreasing one is steeper and more staggered. This works well if you know what your loss function looks like.
Qiskit will try to automatically calibrate the learning rate to the model if you don't specify the learning rate.
We see here that it works the best of all the methods for this small model. For larger models, the convergence will probably be more like the natural gradient.
Limitations
We've seen that training with gradients works well on the small example model. But can we expect the same if we increase the number of qubits? To investigate that, we measure the variance of the gradients for different model sizes. The idea is simple: if the variance is really small, we don't have enough information to update our parameters.
Exponentially vanishing gradients (barren plateaus)
Let's pick a standard parameterized quantum circuit (RealAmplitudes) and see what happens if we increase the number of qubits and layers, that is increase the width and depth of the circuit) as we calculate the gradient.
Let's plot from 2 to 12 qubits.
Oh no! The variance decreases exponentially! This means our gradients contain less and less information and we'll have a hard time to train the model. This is known as the "barren plateau problem", or "exponentially vanishing gradients", discussed in detail in References 4 and 5.
Try it
Do natural gradients have barren plateaus? Try create the above barren plateau plot for natural gradients instead of vanilla gradients in IBM Quantum Lab. You will need to write a new function sample_natural_gradients that computes the natural gradient instead of the gradient.
Is there something we can do about these barren plateaus? It's a hot topic in current research and there are some proposals to mitigate barren plateaus.
Let's have a look at how global and local cost functions and the depth of the ansatz influences the barren plateaus. First, we'll look at short depth, single layer circuits with global operators.
We see that short depth, single layer circuits with global operators still give us barren plateaus.
What if we use local operators?
We see that circuits with local operators still give us barren plateaus.
How about short depth, single layer, circuits with local operators?
We see that the variance of the local operator, constant depth circuit gradients don't vanish, that is, we don't get barren plateaus. However, these circuits are usually easy to simulate and hence these models won't provide any advantage over classical models.
Quick quiz
The variance of the gradient does not vanish for which types of circuit?
global operator, linear depth
global operator, constant depth
local operator, linear depth
local operator, constant depth
This is the inspiration for layer-wise training, where we start with a basic circuit that may not provide any quantum advantage, with one layer of rotations using local operators. We optimize and fix these parameters, then in the next step, we add a second layer of rotations using local operators, and optimize and fix those, and continue for however many layers we want. This potentially avoids barren plateaus as each optimization step is only using constant depth circuits with local operators.
We can implement this in Qiskit in the following way:
We see that as we increase the circuit depth, our loss function decreases towards -1, so we don't see any barren plateaus.
References
Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac and Nathan Killoran, Evaluating analytic gradients on quantum hardware, Physical Revview A 99, 032331 (2019), doi:10.1103/PhysRevA.99.032331, arXiv:1811.11184.
James Stokes, Josh Izaac, Nathan Killoran and Giuseppe Carleo, Quantum Natural Gradient, Quantum 4, 269 (2020), doi:10.22331/q-2020-05-25-269, arXiv:1909.02108.
Julien Gacon, Christa Zoufal, Giuseppe Carleo and Stefan Woerner, Simultaneous Perturbation Stochastic Approximation of the Quantum Fisher Information, arXiv:2103.09232.
Jarrod R. McClean, Sergio Boixo, Vadim N. Smelyanskiy, Ryan Babbush and Hartmut Neven, Barren plateaus in quantum neural network training landscapes, Nature Communications, Volume 9, 4812 (2018), doi:10.1038/s41467-018-07090-4, arXiv:1803.11173.
M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio and Patrick J. Coles, Cost Function Dependent Barren Plateaus in Shallow Parametrized Quantum Circuits, Nature Communications 12, 1791 (2021), doi:10.1038/s41467-021-21728-w, arXiv:2001.00550.