Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
quantum-kittens
GitHub Repository: quantum-kittens/platypus
Path: blob/main/notebooks/quantum-machine-learning/training.ipynb
3855 views
Kernel: Python 3

Training parameterized quantum circuits

In this section we will have a closer look at how to train circuit-based models using gradient based methods. We'll look at the restrictions these models have, and how we might overcome them.

Introduction

Like classical models, we can train parameterized quantum circuit models to perform data-driven tasks. The task of learning an arbitrary function from data is mathematically expressed as the minimization of a cost or loss function ParseError: KaTeX parse error: Undefined control sequence: \class at position 3: f(\̲c̲l̲a̲s̲s̲{theta_vec}{\ve…, also known as the objective function, with respect to the parameter vector ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{theta_vec}{\ve…. Generally, when training a parameterized quantum circuit model, the function we are trying to minimise is the expectation value: ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{_bra_psi_theta…

Image showing a cycle between 'update parameters (theta)' and 'compute (expectation value of H with state Phi(theta))'. The value of the computation is fed back to the 'update parameters' step.

There are many different types of algorithms that we can use to optimise the parameters of a variational circuit, ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{u_theta}{\math…, (gradient-based, evolutionary, and gradient-free methods). In this course, we will be discussing gradient-based methods.

Gradients

Say we have a function ParseError: KaTeX parse error: Undefined control sequence: \class at position 3: f(\̲c̲l̲a̲s̲s̲{theta_vec}{\ve…, and we have access to the gradient of the function, f(θ)\vec{\nabla} f(\vec\theta), starting from an initial point. The simplest way to minimize the function is to update the parameters towards the direction of steepest descent of the function: ParseError: KaTeX parse error: Undefined control sequence: \class at position 35: …\vec\theta_n - \̲c̲l̲a̲s̲s̲{eta}{\eta}\cla…, where η\eta is the learning rate - a small, positive hyperparameter controlling the size of the update. We continue doing this until we converge to a local minimum of the function, ParseError: KaTeX parse error: Undefined control sequence: \class at position 3: f(\̲c̲l̲a̲s̲s̲{theta_vec_star….

graph of f(theta) against theta, multiple dots show different states of a gradient descent algorithm finding the minimum of a curve.

This technique is called gradient descent or vanilla gradient descent, since it's the plain gradient, we aren't doing anything special to it.

Qiskit provides different methods to compute gradients of expectation values, let's explore them!

First, we need to define our parameterized state, ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{ket_psi_theta}…. In this page, ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{u_theta}{U(\ve… is the Qiskit RealAmplitudes circuit on two qubits:

from qiskit.circuit.library import RealAmplitudes ansatz = RealAmplitudes(num_qubits=2, reps=1, entanglement='linear').decompose() ansatz.draw()
Image in a Jupyter notebook

Next we need to define a Hamiltonian, let's use: ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{h_hat}{\hat H}…

from qiskit.opflow import Z, I hamiltonian = Z ^ Z

Putting them together to make the expectation value: ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{_bra_psi_theta…

from qiskit.opflow import StateFn, PauliExpectation expectation = StateFn(hamiltonian, is_measurement=True) @ StateFn(ansatz) pauli_basis = PauliExpectation().convert(expectation)

Next we write a function to simulate the measurement of the expectation value:

from qiskit import Aer from qiskit.utils import QuantumInstance from qiskit.opflow import PauliExpectation, CircuitSampler quantum_instance = QuantumInstance(Aer.get_backend('qasm_simulator'), # we'll set a seed for reproducibility shots = 8192, seed_simulator = 2718, seed_transpiler = 2718) sampler = CircuitSampler(quantum_instance) def evaluate_expectation(theta): # pylint: disable=missing-function-docstring value_dict = dict(zip(ansatz.parameters, theta)) result = sampler.convert(pauli_basis, params=value_dict).eval() return np.real(result)

To make things concrete, let's fix a point p\vec p and an index ii and ask: What's the derivative of the expectation value with respect to parameter θi\theta_i at point p\vec p?

ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{partial_deriva…

We'll choose a random point p\vec p and index i=2i=2 (remember we start counting from 0).

import numpy as np point = np.random.random(ansatz.num_parameters) INDEX = 2

Finite difference gradients

Arguably the simplest way to approximate gradients is with a finite difference scheme. This works independently of the function's inner, possibly very complex, structure.

graph shows a curve, and dots at three points on the curve. The dots are equally-spaced along the x-axis by epsilon.

If we are interested in estimating the gradient at f(θ)f(\vec \theta), we can choose some small distance ϵ\epsilon and calculate f(θ+ϵ)f(\vec \theta+\epsilon) and f(θϵ)f(\vec \theta-\epsilon) and take the difference between the two function values, divided by the distance: f(θ)12ϵ(f(θ+ϵ)f(θϵ))\vec{\nabla} f(\vec \theta) \approx \frac{1}{2\epsilon}\left(f(\vec \theta+\epsilon) - f(\vec \theta-\epsilon) \right).

EPS = 0.2 # make identity vector with a 1 at index ``INDEX``, otherwise 0 e_i = np.identity(point.size)[:, INDEX] plus = point + EPS * e_i minus = point - EPS * e_i finite_difference = ( evaluate_expectation(plus) - evaluate_expectation(minus)) / (2 * EPS) print(finite_difference)
-0.3436279296875

Instead of doing this manually, we can use Qiskit's Gradient class for this.

from qiskit.opflow import Gradient shifter = Gradient('fin_diff', analytic=False, epsilon=EPS) grad = shifter.convert(expectation, params=ansatz.parameters[INDEX]) print(grad) value_dict = dict(zip(ansatz.parameters, point)) sampler.convert(grad, value_dict).eval().real
SummedOp([ 2.5 * ComposedOp([ OperatorMeasurement(ZZ), CircuitStateFn( ┌──────────┐ ┌────────────────┐ q_0: ┤ Ry(θ[0]) ├──■──┤ Ry(θ[2] + 0.2) ├ ├──────────┤┌─┴─┐└──┬──────────┬──┘ q_1: ┤ Ry(θ[1]) ├┤ X ├───┤ Ry(θ[3]) ├─── └──────────┘└───┘ └──────────┘ ) ]), -2.5 * ComposedOp([ OperatorMeasurement(ZZ), CircuitStateFn( ┌──────────┐ ┌────────────────┐ q_0: ┤ Ry(θ[0]) ├──■──┤ Ry(θ[2] - 0.2) ├ ├──────────┤┌─┴─┐└──┬──────────┬──┘ q_1: ┤ Ry(θ[1]) ├┤ X ├───┤ Ry(θ[3]) ├─── └──────────┘└───┘ └──────────┘ ) ]) ])
-0.3686523437500002

Finite difference gradients can be volatile on noisy functions and using the exact formula for the gradient can be more stable. This can be seen above, as although these two calculations make use of the same formula, they yield different results due to the shot noise. In the example image below, we can see the "Noisy finite difference gradient" actually points in the opposite direction of the true gradient!

Image showing a noisy function, the true (average) gradient of the function, and the incorrectly calculated gradient from the finite difference method

Analytic gradients

Analytics gradients evaluate the analytic formula for the gradients. In general, this is fairly difficult as we have to do a manual calculation, but for circuit based gradients, Reference 1 introduces a nice theoretical result that gives an easy formula for calculating gradients: The parameter shift rule.

For a simple circuit consisting of only Pauli rotations, without any coefficients, then this rule says that the analytic gradient is:

Image of simplified parameter shift equation

which is very similar to the equation for finite difference gradients.

Let's try calculate it by hand:

EPS = np.pi / 2 e_i = np.identity(point.size)[:, INDEX] plus = point + EPS * e_i minus = point - EPS * e_i finite_difference = ( evaluate_expectation(plus) - evaluate_expectation(minus)) / 2 print(finite_difference)
-0.34521484375

And using the Qiskit Gradient class:

shifter = Gradient() # parameter-shift rule is the default grad = shifter.convert(expectation, params=ansatz.parameters[INDEX]) sampler.convert(grad, value_dict).eval().real
-0.35302734375

We see that the calculated analytic gradient is fairly similar to the calculated finite difference gradient.

Now that we know to calculate gradients, let's try optimizing the expectation value!

First we fix an initial point for reproducibility.

# initial_point = np.random.random(ansatz.num_parameters) initial_point = np.array([0.43253681, 0.09507794, 0.42805949, 0.34210341])

Similar to how we had a function to evaluate the expectation value, we'll need a function to evaluate the gradient.

gradient = Gradient().convert(expectation) gradient_in_pauli_basis = PauliExpectation().convert(gradient) sampler = CircuitSampler(quantum_instance) def evaluate_gradient(theta): # pylint: disable=missing-function-docstring value_dict = dict(zip(ansatz.parameters, theta)) result = sampler.convert(gradient_in_pauli_basis, params=value_dict).eval() return np.real(result)

To compare the convergence of the optimizers, we can keep track of the loss at each step by using a callback function.

This is the image from the top of the page (the 'update parameter' / 'compute expectation value' cycle) with two arrows to the left of the image pointing vertically up and down. The arrows are labelled 'callback' and 'information' respectively.

class OptimizerLog: # pylint: disable=too-few-public-methods """Log to store optimizer's intermediate results""" def __init__(self): self.loss = [] def update(self, _nfevs, _theta, ftheta, *_): """Save intermediate results. Optimizers pass many values but we only store the third .""" self.loss.append(ftheta) from qiskit.algorithms.optimizers import GradientDescent gd_log = OptimizerLog() gd = GradientDescent(maxiter=300, learning_rate=0.01, callback=gd_log.update)

And now we start the optimization and plot the loss!

result = gd.minimize( fun=evaluate_expectation, # function to minimize x0=initial_point, # initial point jac=evaluate_gradient # function to evaluate gradient ) import matplotlib.pyplot as plt plt.figure(figsize=(7, 3)) plt.plot(gd_log.loss, label='vanilla gradient descent') plt.axhline(-1, ls='--', c='C3', label='target') plt.ylabel('loss') plt.xlabel('iterations') plt.legend();
Image in a Jupyter notebook

Natural gradients

We see in the above example that we are able to find the minimum of the function using gradient descent. However gradient descent is not always the best strategy.

Two images, the left one with a circular landscape, and the path of a gradient descent algorithm shown finding the minimum. The landscape on the right is squashed horizontally, and the path of a gradient descent algorithm is shown overshooting the minium a few times.

For example, if we look at the diagram on the left, given the initial point on the edge of the loss landscape, θ0=(x0,y0)\theta_0 = (x_0,y_0) and learning rate η\eta, we can approach the minimum in the centre.

However, looking at the diagram on the right, where the loss landscape has been squashed in the xx-dimension, we see that using the same initial point and learning rate, we can't find the minimum. This is because we're incorrectly assuming the loss landscape varies at the same rate with respect to each parameter. Both models show the same Euclidean distance between (x0,y0),(x1,y1)(x_0, y_0), (x_1, y_1), but this is insufficient for Model B because this metric fails to capture the relative sensitivities.

The idea of natural gradients is to change the way we determine θn+1\theta_{n+1} from θn\theta_n, by considering the sensitivity of the model. In vanilla gradients, we used the Euclidean distance between them: ParseError: KaTeX parse error: Undefined control sequence: \class at position 5: d = \̲c̲l̲a̲s̲s̲{euclidean_dist…, but we saw that this doesn't take the loss landscape into account. With natural gradients, we instead use a distance that depends on our model: d=Ψ(θn)Ψ(θn+1)2d = \lVert\langle \Psi(\vec\theta_n) | \Psi(\vec\theta_{n+1}) \rangle \rVert^2.

Two images, the left one with a circular landscape, and a line drawn between two points on the landscape. One point is near the edge of the landscape, and the other near the minimum at the centre. The landscape on the right is squashed horizontally, but there is a line drawn between two points on the landscape. One point is near the edge of the landscape, and the other near the minimum at the centre.

This metric is called the Quantum Fisher Information, gij(θ)g_{ij}(\vec\theta), and allows us to transform the steepest descent in the Euclidean parameter space to the steepest descent in the model space. This is called the Quantum Natural Gradient, and is introduced in Reference 2, where ParseError: KaTeX parse error: Undefined control sequence: \class at position 38: …c\theta_n-\eta \̲c̲l̲a̲s̲s̲{}{g^{-1}}(\vec….

Two images, the left one with a circular landscape, and the path of a gradient descent algorithm shown finding the minimum. The landscape on the right is squashed horizontally, and the path of a gradient descent algorithm is shown, with each step also squashed horizontally, allowing the path to find the minimum.

We can evaluate the natural gradient in Qiskit using the NaturalGradient instead of the Gradient.

from qiskit.opflow import NaturalGradient

Analogous to the function that computes gradients, we can now write a function that evaluates the natural gradients.

natural_gradient = (NaturalGradient(regularization='ridge') .convert(expectation)) natural_gradient_in_pauli_basis = PauliExpectation().convert( natural_gradient) sampler = CircuitSampler(quantum_instance, caching="all") def evaluate_natural_gradient(theta): # pylint: disable=missing-function-docstring value_dict = dict(zip(ansatz.parameters, theta)) result = sampler.convert(natural_gradient, params=value_dict).eval() return np.real(result) print('Vanilla gradient:', evaluate_gradient(initial_point)) print('Natural gradient:', evaluate_natural_gradient(initial_point))
Vanilla gradient: [ 0.13989258 -0.35095215 -0.25402832 -0.22497559] Natural gradient: [ 0.7158704 -0.86457346 -0.98086467 -0.33820315]

And as you can see they do indeed differ!

Let's look at how this influences the convergence.

qng_log = OptimizerLog() qng = GradientDescent(maxiter=300, learning_rate=0.01, callback=qng_log.update) result = qng.minimize(evaluate_expectation, initial_point, evaluate_natural_gradient) # Plot loss plt.figure(figsize=(7, 3)) plt.plot(gd_log.loss, 'C0', label='vanilla gradient descent') plt.plot(qng_log.loss, 'C1', label='quantum natural gradient') plt.axhline(-1, c='C3', ls='--', label='target') plt.ylabel('loss') plt.xlabel('iterations') plt.legend();
Image in a Jupyter notebook

This looks great! We can see that the quantum natural gradient approaches the target faster than vanilla gradient descent. However, this comes at the cost of needing to evaluate many more quantum circuits.

Graph of '

Simultaneous Perturbation Stochastic Approximation

Looking at our function f(θ)f(\vec\theta) as a vector, if we want to evaluate the gradient f(θ)\vec{\nabla} f(\vec\theta), we need to calculate the partial derivation of f(θ)f(\vec\theta) with respect to each parameter, meaning we would need 2N2N function evaluations for NN parameters to calculate the gradient.

Simultaneous Perturbation Stochastic Approximation (SPSA) is an optimization technique where we randomly sample from the gradient, to reduce the number of evaluations. Since we don't care about the exact values but only about convergence, an unbiased sampling should on average work equally well.

In practice, while the exact gradient follows a smooth path to the minimum, SPSA will jump around due to the random sampling, but it will converge, given the same boundary conditions as the gradient.

And how does it perform? We use the SPSA algorithm in Qiskit.

from qiskit.algorithms.optimizers import SPSA spsa_log = OptimizerLog() spsa = SPSA(maxiter=300, learning_rate=0.01, perturbation=0.01, callback=spsa_log.update) result = spsa.minimize(evaluate_expectation, initial_point) # Plot loss plt.figure(figsize=(7, 3)) plt.plot(gd_log.loss, 'C0', label='vanilla gradient descent') plt.plot(qng_log.loss, 'C1', label='quantum natural gradient') plt.plot(spsa_log.loss, 'C0', ls='--', label='SPSA') plt.axhline(-1, c='C3', ls='--', label='target') plt.ylabel('loss') plt.xlabel('iterations') plt.legend();
Image in a Jupyter notebook

We can see that SPSA basically follows the gradient descent curve, and at a fraction of the cost!

We can do the same for natural gradients as well, as described in Reference 3. We'll skip the details here, but the idea is to sample not only from the gradient, but to extend this to the quantum Fisher information and thus to the natural gradient.

Qiskit implements this as the QNSPSA algorithm. Let's compare its performance:

from qiskit.algorithms.optimizers import QNSPSA qnspsa_log = OptimizerLog() fidelity = QNSPSA.get_fidelity(ansatz, quantum_instance, expectation=PauliExpectation()) qnspsa = QNSPSA(fidelity, maxiter=300, learning_rate=0.01, perturbation=0.01, callback=qnspsa_log.update) result = qnspsa.minimize(evaluate_expectation, initial_point) # Plot loss plt.figure(figsize=(7, 3)) plt.plot(gd_log.loss, 'C0', label='vanilla gradient descent') plt.plot(qng_log.loss, 'C1', label='quantum natural gradient') plt.plot(spsa_log.loss, 'C0', ls='--', label='SPSA') plt.plot(qnspsa_log.loss, 'C1', ls='--', label='QN-SPSA') plt.axhline(-1, c='C3', ls='--', label='target') plt.ylabel('loss') plt.xlabel('iterations') plt.legend();
Image in a Jupyter notebook

We can see that QNSPSA somewhat follows the natural gradient descent curve.

Graph of 'evaluations per iteration' on the y-axis and 'number of parameters in ansatz' on x-axis. Two solid lines are plotted, one labelled "natural gradient" and the other "vanilla gradient", the vanilla gradient line is lower than the natural at all points on the graph and grows at a slower rate. Two dotted horizontal lines are plotted. The higher horizontal line is labelled "QN-SPSA" and intersects the y-axis just above the "natural gradient" line. The lower dotted line is labelled "SPSA" and intersects just above the "vanilla gradient" line

The vanilla and natural gradient costs are linear and quadratic in terms of the number of parameters, while the costs for SPSA and QNSPSA are constant, i.e. independent of the number of parameters. There is the small offset between the costs for SPSA and QNSPSA as more evaluations are required to approximate the natural gradient.

Quick quiz

q-carousel q-quiz(goal="qml-training-quizousel-0") .question.md The _Gradient_ training method... .option(x) updates the circuit parameters using the gradient of the loss function .option updates the circuit parameters using the quantum natural gradient of the loss function .option updates the circuit parameters using the approximate gradient of the loss function, calculated using random sampling .option update the circuit parameters using the approximate quantum natural gradient of the loss function, calculated using random sampling q-quiz(goal="qml-training-quizousel-1") .question.md The _Natural Gradient_ training method... .option updates the circuit parameters using the gradient of the loss function .option(x) updates the circuit parameters using the quantum natural gradient of the loss function .option updates the circuit parameters using the approximate gradient of the loss function, calculated using random sampling .option update the circuit parameters using the approximate quantum natural gradient of the loss function, calculated using random sampling q-quiz(goal="qml-training-quizousel-2") .question.md The _SPSA_ training method... .option updates the circuit parameters using the gradient of the loss function .option updates the circuit parameters using the quantum natural gradient of the loss function .option(x) updates the circuit parameters using the approximate gradient of the loss function, calculated using random sampling .option update the circuit parameters using the approximate quantum natural gradient of the loss function, calculated using random sampling q-quiz(goal="qml-training-quizousel-3") .question.md The _QNSPSA_ training method... .option updates the circuit parameters using the gradient of the loss function .option updates the circuit parameters using the quantum natural gradient of the loss function .option updates the circuit parameters using the approximate gradient of the loss function, calculated using random sampling .option(x) update the circuit parameters using the approximate quantum natural gradient of the loss function, calculated using random sampling

Training in practice

In this era of near-term quantum computing, circuit evaluations are expensive, and readouts are not perfect due to the noisy nature of the devices. Therefore in practice, people often resort to using SPSA. To improve convergence, we don't use a constant learning rate, but an exponentially decreasing one. The diagram below shows the typical convergence between a constant learning rate (dotted lines) versus an exponentially decreasing one (solid lines). We see that the convergence for a constant learning rate is smooth decreasing line, while the convergence for an exponentially decreasing one is steeper and more staggered. This works well if you know what your loss function looks like.

Graph with "Loss" on the y-xis and "Iterations" on the x-axis. There is a green horizontal line labelled "target" towards the bottom of the graph. A dotted curve labelled "fixed eta" starts high on the y-axis and tends slowly down towards the target line. A different, noisy curve labelled "power-law eta" starts at the same point on the y-axis as "fixed eta", but descends faster towards the "Target line".

Graph with "Learning rate (eta)" on the y-xis and "Iterations" on the x-axis. There is a green, dotted, horizontal line labelled "eta = const" roughly 30% up the y-axis. A dotted curve labelled "eta proportional to alpha^n" starts high on the y-axis and tends slowly down past the "eta = const" line.

Qiskit will try to automatically calibrate the learning rate to the model if you don't specify the learning rate.

autospsa_log = OptimizerLog() autospsa = SPSA(maxiter=300, learning_rate=None, perturbation=None, callback=autospsa_log.update) result = autospsa.minimize(evaluate_expectation, initial_point) # Plot loss plt.figure(figsize=(7, 3)) plt.plot(gd_log.loss, 'C0', label='vanilla gradient descent') plt.plot(qng_log.loss, 'C1', label='quantum natural gradient') plt.plot(spsa_log.loss, 'C0', ls='--', label='SPSA') plt.plot(qnspsa_log.loss, 'C1', ls='--', label='QN-SPSA') plt.plot(autospsa_log.loss, 'C3', label='Power-law SPSA') plt.axhline(-1, c='C3', ls='--', label='target') plt.ylabel('loss') plt.xlabel('iterations') plt.legend();
Image in a Jupyter notebook

We see here that it works the best of all the methods for this small model. For larger models, the convergence will probably be more like the natural gradient.

Limitations

We've seen that training with gradients works well on the small example model. But can we expect the same if we increase the number of qubits? To investigate that, we measure the variance of the gradients for different model sizes. The idea is simple: if the variance is really small, we don't have enough information to update our parameters.

Exponentially vanishing gradients (barren plateaus)

Let's pick a standard parameterized quantum circuit (RealAmplitudes) and see what happens if we increase the number of qubits and layers, that is increase the width and depth of the circuit) as we calculate the gradient.

Image of an 'n' qubit quantum circuit. An RY(theta_n) acts on each qubit, followed by some CNOTs and more RY gates (with different parameters). The CNOTs and other RY gates are repeated 'n' times.

from qiskit.opflow import I def sample_gradients(num_qubits, reps, local=False): """Sample the gradient of our model for ``num_qubits`` qubits and ``reps`` repetitions. We sample 100 times for random parameters and compute the gradient of the first RY rotation gate. """ index = num_qubits - 1 # local or global operator if local: operator = Z ^ Z ^ (I ^ (num_qubits - 2)) else: operator = Z ^ num_qubits # real amplitudes ansatz ansatz = RealAmplitudes(num_qubits, entanglement='linear', reps=reps) # construct Gradient we want to evaluate for different values expectation = StateFn(operator, is_measurement=True).compose(StateFn(ansatz)) grad = Gradient().convert(expectation, params=ansatz.parameters[index]) # evaluate for 100 different, random parameter values num_points = 100 grads = [] for _ in range(num_points): # points are uniformly chosen from [0, pi] point = np.random.uniform(0, np.pi, ansatz.num_parameters) value_dict = dict(zip(ansatz.parameters, point)) grads.append(sampler.convert(grad, value_dict).eval()) return grads

Let's plot from 2 to 12 qubits.

num_qubits = list(range(2, 13)) reps = num_qubits # number of layers = numbers of qubits gradients = [sample_gradients(n, r) for n, r in zip(num_qubits, reps)] fit = np.polyfit(num_qubits, np.log(np.var(gradients, axis=1)), deg=1) x = np.linspace(num_qubits[0], num_qubits[-1], 200) plt.figure(figsize=(7, 3)) plt.semilogy(num_qubits, np.var(gradients, axis=1), 'o-', label='measured variance') plt.semilogy(x, np.exp(fit[0] * x + fit[1]), '--', c='C3', label=f'exponential fit w/ {fit[0]:.2f}') plt.xlabel('number of qubits') plt.ylabel(r'$\mathrm{Var}[\partial_{\theta 1}\langle E(\theta)\rangle]$') plt.legend(loc='best');
Image in a Jupyter notebook

Oh no! The variance decreases exponentially! This means our gradients contain less and less information and we'll have a hard time to train the model. This is known as the "barren plateau problem", or "exponentially vanishing gradients", discussed in detail in References 4 and 5.

Try it

Do natural gradients have barren plateaus? Try create the above barren plateau plot for natural gradients instead of vanilla gradients in IBM Quantum Lab. You will need to write a new function sample_natural_gradients that computes the natural gradient instead of the gradient.

Is there something we can do about these barren plateaus? It's a hot topic in current research and there are some proposals to mitigate barren plateaus.

Let's have a look at how global and local cost functions and the depth of the ansatz influences the barren plateaus. First, we'll look at short depth, single layer circuits with global operators.

num_qubits = list(range(2, 13)) fixed_depth_global_gradients = [sample_gradients(n, 1) for n in num_qubits] fit = np.polyfit(num_qubits, np.log(np.var(fixed_depth_global_gradients, axis=1)), deg=1) x = np.linspace(num_qubits[0], num_qubits[-1], 200) plt.figure(figsize=(7, 3)) plt.semilogy(num_qubits, np.var(gradients, axis=1), 'o-', label='global cost, linear depth') plt.semilogy(num_qubits, np.var(fixed_depth_global_gradients, axis=1), 'o-', label='global cost, constant depth') plt.semilogy(x, np.exp(fit[0] * x + fit[1]), '--', c='C3', label=f'exponential fit w/ {fit[0]:.2f}') plt.xlabel('number of qubits') plt.ylabel(r'$\mathrm{Var}[\partial_{\theta 1}\langle E(\theta)\rangle]$') plt.legend(loc='best');
Image in a Jupyter notebook

We see that short depth, single layer circuits with global operators still give us barren plateaus.

What if we use local operators?

num_qubits = list(range(2, 13)) linear_depth_local_gradients = [sample_gradients(n, n, local=True) for n in num_qubits] fit = np.polyfit(num_qubits, np.log(np.var(linear_depth_local_gradients,axis=1)), deg=1) x = np.linspace(num_qubits[0], num_qubits[-1], 200) plt.figure(figsize=(7, 3)) plt.semilogy(num_qubits, np.var(gradients, axis=1), 'o-', label='global cost, linear depth') plt.semilogy(num_qubits, np.var(fixed_depth_global_gradients, axis=1), 'o-', label='global cost, constant depth') plt.semilogy(num_qubits, np.var(linear_depth_local_gradients, axis=1), 'o-', label='local cost, linear depth') plt.semilogy(x, np.exp(fit[0] * x + fit[1]), '--', c='C3', label=f'exponential fit w/ {fit[0]:.2f}') plt.xlabel('number of qubits') plt.ylabel(r'$\mathrm{Var}[\partial_{\theta 1}\langle E(\theta)\rangle]$') plt.legend(loc='best');
Image in a Jupyter notebook

We see that circuits with local operators still give us barren plateaus.

How about short depth, single layer, circuits with local operators?

num_qubits = list(range(2, 13)) fixed_depth_local_gradients = [sample_gradients(n, 1, local=True) for n in num_qubits] fit = np.polyfit(num_qubits, np.log(np.var(fixed_depth_local_gradients, axis=1)), deg=1) x = np.linspace(num_qubits[0], num_qubits[-1], 200) plt.figure(figsize=(7, 3)) plt.semilogy(num_qubits, np.var(gradients, axis=1), 'o-', label='global cost, linear depth') plt.semilogy(num_qubits, np.var(fixed_depth_global_gradients, axis=1), 'o-', label='global cost, constant depth') plt.semilogy(num_qubits, np.var(linear_depth_local_gradients, axis=1), 'o-', label='local cost, linear depth') plt.semilogy(num_qubits, np.var(fixed_depth_local_gradients, axis=1), 'o-', label='local cost, constant depth') plt.semilogy(x, np.exp(fit[0] * x + fit[1]), '--', c='C3', label=f'exponential fit w/ {fit[0]:.2f}') plt.xlabel('number of qubits') plt.ylabel(r'$\mathrm{Var}[\partial_{\theta 1}\langle E(\theta)\rangle]$') plt.legend(loc='best');
Image in a Jupyter notebook

We see that the variance of the local operator, constant depth circuit gradients don't vanish, that is, we don't get barren plateaus. However, these circuits are usually easy to simulate and hence these models won't provide any advantage over classical models.

Quick quiz

The variance of the gradient does not vanish for which types of circuit?

  1. global operator, linear depth

  1. global operator, constant depth

  1. local operator, linear depth

  1. local operator, constant depth

This is the inspiration for layer-wise training, where we start with a basic circuit that may not provide any quantum advantage, with one layer of rotations using local operators. We optimize and fix these parameters, then in the next step, we add a second layer of rotations using local operators, and optimize and fix those, and continue for however many layers we want. This potentially avoids barren plateaus as each optimization step is only using constant depth circuits with local operators.

There are three, 3-qubit quantum circuits, one of each of three iterations. In the first circuit, there is only one layer of parameterized gates, which we train in iteration 1. Each following iteration gradually adds more parameterized gates and trains only these new gates.

We can implement this in Qiskit in the following way:

NUM_QUBITS = 6 OPERATOR = Z ^ Z ^ (I ^ (NUM_QUBITS - 4)) def minimize(circuit, optimizer): """ Args: circuit (QuantumCircuit): (Partially bound) ansatz circuit to train optimizer (Optimizer): Algorithm to use to minimize exp. value Returns: OptimizerResult: Result of minimization """ initial_point = np.random.random(circuit.num_parameters) exp = StateFn(OPERATOR, is_measurement=True) @ StateFn(circuit) grad = Gradient().convert(exp) exp = PauliExpectation().convert(exp) grad = PauliExpectation().convert(grad) sampler = CircuitSampler(quantum_instance, caching="all") def loss(theta): values_dict = dict(zip(circuit.parameters, theta)) return np.real(sampler.convert(exp, values_dict).eval()) def gradient(theta): values_dict = dict(zip(circuit.parameters, theta)) return np.real(sampler.convert(grad, values_dict).eval()) return optimizer.minimize(loss, initial_point, gradient) def layerwise_training(ansatz, max_num_layers, optimizer): """ Args: ansatz (QuantumCircuit): Single circuit layer to train & repeat max_num_layers (int): Maximum number of layers optimizer (Optimizer): Algorithm to use to minimize exp. value Returns: float: Lowest value acheived list[float]: Best parameters found """ optimal_parameters = [] for reps in range(max_num_layers): ansatz.reps = reps # fix the already optimized parameters values_dict = dict(zip(ansatz.parameters, optimal_parameters)) partially_bound = ansatz.bind_parameters(values_dict) result = minimize(partially_bound, optimizer) optimal_parameters += list(result.x) print('Layer:', reps, ' Best Value:', result.fun) return result.fun, optimal_parameters ansatz = RealAmplitudes(4, entanglement='linear') optimizer = GradientDescent(maxiter=50) np.random.seed(12) # for reproducibility fopt, optimal_parameters = layerwise_training(ansatz, 4, optimizer)
Layer: 0 Best Value: 0.6162109375 Layer: 1 Best Value: 0.25537109375000006 Layer: 2 Best Value: -0.12353515625 Layer: 3 Best Value: -0.546875

We see that as we increase the circuit depth, our loss function decreases towards -1, so we don't see any barren plateaus.

References

  1. Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac and Nathan Killoran, Evaluating analytic gradients on quantum hardware, Physical Revview A 99, 032331 (2019), doi:10.1103/PhysRevA.99.032331, arXiv:1811.11184.

  2. James Stokes, Josh Izaac, Nathan Killoran and Giuseppe Carleo, Quantum Natural Gradient, Quantum 4, 269 (2020), doi:10.22331/q-2020-05-25-269, arXiv:1909.02108.

  3. Julien Gacon, Christa Zoufal, Giuseppe Carleo and Stefan Woerner, Simultaneous Perturbation Stochastic Approximation of the Quantum Fisher Information, arXiv:2103.09232.

  4. Jarrod R. McClean, Sergio Boixo, Vadim N. Smelyanskiy, Ryan Babbush and Hartmut Neven, Barren plateaus in quantum neural network training landscapes, Nature Communications, Volume 9, 4812 (2018), doi:10.1038/s41467-018-07090-4, arXiv:1803.11173.

  5. M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio and Patrick J. Coles, Cost Function Dependent Barren Plateaus in Shallow Parametrized Quantum Circuits, Nature Communications 12, 1791 (2021), doi:10.1038/s41467-021-21728-w, arXiv:2001.00550.

# pylint: disable=unused-import import qiskit.tools.jupyter %qiskit_version_table