GitHub Repository: quantum-kittens/platypus
Path: blob/main/notebooks/quantum-machine-learning/training.ipynb
³⁸⁵⁵ views

Kernel: Python 3

Training parameterized quantum circuits

In this section we will have a closer look at how to train circuit-based models using gradient based methods. We'll look at the restrictions these models have, and how we might overcome them.

Introduction

Like classical models, we can train parameterized quantum circuit models to perform data-driven tasks. The task of learning an arbitrary function from data is mathematically expressed as the minimization of a cost or loss function ParseError: KaTeX parse error: Undefined control sequence: \class at position 3: f(\̲c̲l̲a̲s̲s̲{theta_vec}{\ve…, also known as the objective function, with respect to the parameter vector ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{theta_vec}{\ve…. Generally, when training a parameterized quantum circuit model, the function we are trying to minimise is the expectation value: ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{_bra_psi_theta…

There are many different types of algorithms that we can use to optimise the parameters of a variational circuit, ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{u_theta}{\math…, (gradient-based, evolutionary, and gradient-free methods). In this course, we will be discussing gradient-based methods.

Gradients

Say we have a function ParseError: KaTeX parse error: Undefined control sequence: \class at position 3: f(\̲c̲l̲a̲s̲s̲{theta_vec}{\ve…, and we have access to the gradient of the function, $\vec{\nabla} f(\vec\theta)$ , starting from an initial point. The simplest way to minimize the function is to update the parameters towards the direction of steepest descent of the function: ParseError: KaTeX parse error: Undefined control sequence: \class at position 35: …\vec\theta_n - \̲c̲l̲a̲s̲s̲{eta}{\eta}\cla…, where $\eta$ is the learning rate - a small, positive hyperparameter controlling the size of the update. We continue doing this until we converge to a local minimum of the function, ParseError: KaTeX parse error: Undefined control sequence: \class at position 3: f(\̲c̲l̲a̲s̲s̲{theta_vec_star….

This technique is called gradient descent or vanilla gradient descent, since it's the plain gradient, we aren't doing anything special to it.

Qiskit provides different methods to compute gradients of expectation values, let's explore them!

First, we need to define our parameterized state, ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{ket_psi_theta}…. In this page, ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{u_theta}{U(\ve… is the Qiskit RealAmplitudes circuit on two qubits:

In [1]:

from qiskit.circuit.library import RealAmplitudes
ansatz = RealAmplitudes(num_qubits=2, reps=1,
                        entanglement='linear').decompose()
ansatz.draw()

Out[1]:

Next we need to define a Hamiltonian, let's use: ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{h_hat}{\hat H}…

In [2]:

from qiskit.opflow import Z, I
hamiltonian = Z ^ Z

Putting them together to make the expectation value: ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{_bra_psi_theta…

In [3]:

from qiskit.opflow import StateFn, PauliExpectation
expectation = StateFn(hamiltonian, is_measurement=True) @ StateFn(ansatz)
pauli_basis = PauliExpectation().convert(expectation)

Next we write a function to simulate the measurement of the expectation value:

In [4]:

from qiskit import Aer
from qiskit.utils import QuantumInstance
from qiskit.opflow import PauliExpectation, CircuitSampler

quantum_instance = QuantumInstance(Aer.get_backend('qasm_simulator'),
                                   # we'll set a seed for reproducibility
                                   shots = 8192, seed_simulator = 2718,
                                   seed_transpiler = 2718)
sampler = CircuitSampler(quantum_instance)

def evaluate_expectation(theta):
    # pylint: disable=missing-function-docstring
    value_dict = dict(zip(ansatz.parameters, theta))
    result = sampler.convert(pauli_basis, params=value_dict).eval()
    return np.real(result)

To make things concrete, let's fix a point $\vec p$ and an index $i$ and ask: What's the derivative of the expectation value with respect to parameter $\theta_i$ at point $\vec p$ ?

ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{partial_deriva…

We'll choose a random point $\vec p$ and index $i=2$ (remember we start counting from 0).

In [5]:

import numpy as np
point = np.random.random(ansatz.num_parameters)
INDEX = 2

Finite difference gradients

Arguably the simplest way to approximate gradients is with a finite difference scheme. This works independently of the function's inner, possibly very complex, structure.

If we are interested in estimating the gradient at $f(\vec \theta)$ , we can choose some small distance $\epsilon$ and calculate $f(\vec \theta+\epsilon)$ and $f(\vec \theta-\epsilon)$ and take the difference between the two function values, divided by the distance: $\vec{\nabla} f(\vec \theta) \approx \frac{1}{2\epsilon}\left(f(\vec \theta+\epsilon) - f(\vec \theta-\epsilon) \right)$ .

In [6]:

EPS = 0.2
# make identity vector with a 1 at index ``INDEX``, otherwise 0
e_i = np.identity(point.size)[:, INDEX]

plus = point + EPS * e_i
minus = point - EPS * e_i

finite_difference = (
    evaluate_expectation(plus) - evaluate_expectation(minus)) / (2 * EPS)
print(finite_difference)

Out[6]:

-0.3436279296875

Instead of doing this manually, we can use Qiskit's Gradient class for this.

In [7]:

from qiskit.opflow import Gradient

shifter = Gradient('fin_diff', analytic=False, epsilon=EPS)
grad = shifter.convert(expectation, params=ansatz.parameters[INDEX])
print(grad)

value_dict = dict(zip(ansatz.parameters, point))
sampler.convert(grad, value_dict).eval().real

Out[7]:

SummedOp([
  2.5 * ComposedOp([
    OperatorMeasurement(ZZ),
    CircuitStateFn(
         ┌──────────┐     ┌────────────────┐
    q_0: ┤ Ry(θ[0]) ├──■──┤ Ry(θ[2] + 0.2) ├
         ├──────────┤┌─┴─┐└──┬──────────┬──┘
    q_1: ┤ Ry(θ[1]) ├┤ X ├───┤ Ry(θ[3]) ├───
         └──────────┘└───┘   └──────────┘   
    )
  ]),
  -2.5 * ComposedOp([
    OperatorMeasurement(ZZ),
    CircuitStateFn(
         ┌──────────┐     ┌────────────────┐
    q_0: ┤ Ry(θ[0]) ├──■──┤ Ry(θ[2] - 0.2) ├
         ├──────────┤┌─┴─┐└──┬──────────┬──┘
    q_1: ┤ Ry(θ[1]) ├┤ X ├───┤ Ry(θ[3]) ├───
         └──────────┘└───┘   └──────────┘   
    )
  ])
])

-0.3686523437500002

Finite difference gradients can be volatile on noisy functions and using the exact formula for the gradient can be more stable. This can be seen above, as although these two calculations make use of the same formula, they yield different results due to the shot noise. In the example image below, we can see the "Noisy finite difference gradient" actually points in the opposite direction of the true gradient!

Analytic gradients

Analytics gradients evaluate the analytic formula for the gradients. In general, this is fairly difficult as we have to do a manual calculation, but for circuit based gradients, Reference 1 introduces a nice theoretical result that gives an easy formula for calculating gradients: The parameter shift rule.

For a simple circuit consisting of only Pauli rotations, without any coefficients, then this rule says that the analytic gradient is:

which is very similar to the equation for finite difference gradients.

Let's try calculate it by hand:

In [8]:

EPS = np.pi / 2
e_i = np.identity(point.size)[:, INDEX]

plus = point + EPS * e_i
minus = point - EPS * e_i

finite_difference = (
    evaluate_expectation(plus) - evaluate_expectation(minus)) / 2

print(finite_difference)

Out[8]:

-0.34521484375

And using the Qiskit Gradient class:

In [9]:

shifter = Gradient()  # parameter-shift rule is the default
grad = shifter.convert(expectation, params=ansatz.parameters[INDEX])
sampler.convert(grad, value_dict).eval().real

Out[9]:

-0.35302734375

We see that the calculated analytic gradient is fairly similar to the calculated finite difference gradient.

Now that we know to calculate gradients, let's try optimizing the expectation value!

First we fix an initial point for reproducibility.

In [10]:

# initial_point = np.random.random(ansatz.num_parameters)
initial_point = np.array([0.43253681, 0.09507794, 0.42805949, 0.34210341])

Similar to how we had a function to evaluate the expectation value, we'll need a function to evaluate the gradient.

In [11]:

gradient = Gradient().convert(expectation)
gradient_in_pauli_basis = PauliExpectation().convert(gradient)
sampler = CircuitSampler(quantum_instance)

def evaluate_gradient(theta):
    # pylint: disable=missing-function-docstring
    value_dict = dict(zip(ansatz.parameters, theta))
    result = sampler.convert(gradient_in_pauli_basis,
                             params=value_dict).eval()
    return np.real(result)

To compare the convergence of the optimizers, we can keep track of the loss at each step by using a callback function.

In [12]:

class OptimizerLog:  # pylint: disable=too-few-public-methods
    """Log to store optimizer's intermediate results"""
    def __init__(self):
        self.loss = []
    def update(self, _nfevs, _theta, ftheta, *_):
        """Save intermediate results. Optimizers pass many values
        but we only store the third ."""
        self.loss.append(ftheta)

from qiskit.algorithms.optimizers import GradientDescent
gd_log = OptimizerLog()
gd = GradientDescent(maxiter=300,
                     learning_rate=0.01,
                     callback=gd_log.update)

And now we start the optimization and plot the loss!

In [13]:

result = gd.minimize(
    fun=evaluate_expectation,  # function to minimize
    x0=initial_point,          # initial point
    jac=evaluate_gradient      # function to evaluate gradient
)

import matplotlib.pyplot as plt
plt.figure(figsize=(7, 3))
plt.plot(gd_log.loss, label='vanilla gradient descent')
plt.axhline(-1, ls='--', c='C3', label='target')
plt.ylabel('loss')
plt.xlabel('iterations')
plt.legend();

Out[13]:

Natural gradients

We see in the above example that we are able to find the minimum of the function using gradient descent. However gradient descent is not always the best strategy.

For example, if we look at the diagram on the left, given the initial point on the edge of the loss landscape, $\theta_0 = (x_0,y_0)$ and learning rate $\eta$ , we can approach the minimum in the centre.

However, looking at the diagram on the right, where the loss landscape has been squashed in the $x$ -dimension, we see that using the same initial point and learning rate, we can't find the minimum. This is because we're incorrectly assuming the loss landscape varies at the same rate with respect to each parameter. Both models show the same Euclidean distance between $(x_0, y_0), (x_1, y_1)$ , but this is insufficient for Model B because this metric fails to capture the relative sensitivities.

The idea of natural gradients is to change the way we determine $\theta_{n+1}$ from $\theta_n$ , by considering the sensitivity of the model. In vanilla gradients, we used the Euclidean distance between them: ParseError: KaTeX parse error: Undefined control sequence: \class at position 5: d = \̲c̲l̲a̲s̲s̲{euclidean_dist…, but we saw that this doesn't take the loss landscape into account. With natural gradients, we instead use a distance that depends on our model: $d = \lVert\langle \Psi(\vec\theta_n) | \Psi(\vec\theta_{n+1}) \rangle \rVert^2$ .

This metric is called the Quantum Fisher Information, $g_{ij}(\vec\theta)$ , and allows us to transform the steepest descent in the Euclidean parameter space to the steepest descent in the model space. This is called the Quantum Natural Gradient, and is introduced in Reference 2, where ParseError: KaTeX parse error: Undefined control sequence: \class at position 38: …c\theta_n-\eta \̲c̲l̲a̲s̲s̲{}{g^{-1}}(\vec….

We can evaluate the natural gradient in Qiskit using the NaturalGradient instead of the Gradient.

In [14]:

from qiskit.opflow import NaturalGradient

Analogous to the function that computes gradients, we can now write a function that evaluates the natural gradients.

In [15]:

natural_gradient = (NaturalGradient(regularization='ridge')
                    .convert(expectation))
natural_gradient_in_pauli_basis = PauliExpectation().convert(
                                                    natural_gradient)
sampler = CircuitSampler(quantum_instance, caching="all")

def evaluate_natural_gradient(theta):
    # pylint: disable=missing-function-docstring
    value_dict = dict(zip(ansatz.parameters, theta))
    result = sampler.convert(natural_gradient, params=value_dict).eval()
    return np.real(result)

print('Vanilla gradient:', evaluate_gradient(initial_point))
print('Natural gradient:', evaluate_natural_gradient(initial_point))

Out[15]:

Vanilla gradient: [ 0.13989258 -0.35095215 -0.25402832 -0.22497559]
Natural gradient: [ 0.7158704  -0.86457346 -0.98086467 -0.33820315]

And as you can see they do indeed differ!

Let's look at how this influences the convergence.

In [16]:

qng_log = OptimizerLog()
qng = GradientDescent(maxiter=300,
                      learning_rate=0.01,
                      callback=qng_log.update)

result = qng.minimize(evaluate_expectation,
                      initial_point,
                      evaluate_natural_gradient)

# Plot loss
plt.figure(figsize=(7, 3))
plt.plot(gd_log.loss, 'C0', label='vanilla gradient descent')
plt.plot(qng_log.loss, 'C1', label='quantum natural gradient')
plt.axhline(-1, c='C3', ls='--', label='target')
plt.ylabel('loss')
plt.xlabel('iterations')
plt.legend();

Out[16]:

This looks great! We can see that the quantum natural gradient approaches the target faster than vanilla gradient descent. However, this comes at the cost of needing to evaluate many more quantum circuits.

Simultaneous Perturbation Stochastic Approximation

Looking at our function $f(\vec\theta)$ as a vector, if we want to evaluate the gradient $\vec{\nabla} f(\vec\theta)$ , we need to calculate the partial derivation of $f(\vec\theta)$ with respect to each parameter, meaning we would need $2N$ function evaluations for $N$ parameters to calculate the gradient.

Simultaneous Perturbation Stochastic Approximation (SPSA) is an optimization technique where we randomly sample from the gradient, to reduce the number of evaluations. Since we don't care about the exact values but only about convergence, an unbiased sampling should on average work equally well.

In practice, while the exact gradient follows a smooth path to the minimum, SPSA will jump around due to the random sampling, but it will converge, given the same boundary conditions as the gradient.

And how does it perform? We use the SPSA algorithm in Qiskit.

In [17]:

from qiskit.algorithms.optimizers import SPSA
spsa_log = OptimizerLog()
spsa = SPSA(maxiter=300, learning_rate=0.01,
            perturbation=0.01, callback=spsa_log.update)

result = spsa.minimize(evaluate_expectation, initial_point)

# Plot loss
plt.figure(figsize=(7, 3))
plt.plot(gd_log.loss, 'C0', label='vanilla gradient descent')
plt.plot(qng_log.loss, 'C1', label='quantum natural gradient')
plt.plot(spsa_log.loss, 'C0', ls='--', label='SPSA')
plt.axhline(-1, c='C3', ls='--', label='target')
plt.ylabel('loss')
plt.xlabel('iterations')
plt.legend();

Out[17]:

We can see that SPSA basically follows the gradient descent curve, and at a fraction of the cost!

We can do the same for natural gradients as well, as described in Reference 3. We'll skip the details here, but the idea is to sample not only from the gradient, but to extend this to the quantum Fisher information and thus to the natural gradient.

Qiskit implements this as the QNSPSA algorithm. Let's compare its performance:

In [18]:

from qiskit.algorithms.optimizers import QNSPSA
qnspsa_log = OptimizerLog()
fidelity = QNSPSA.get_fidelity(ansatz,
                               quantum_instance,
                               expectation=PauliExpectation())
qnspsa = QNSPSA(fidelity, maxiter=300, learning_rate=0.01,
                                       perturbation=0.01,
                                       callback=qnspsa_log.update)

result = qnspsa.minimize(evaluate_expectation, initial_point)

# Plot loss
plt.figure(figsize=(7, 3))
plt.plot(gd_log.loss, 'C0', label='vanilla gradient descent')
plt.plot(qng_log.loss, 'C1', label='quantum natural gradient')
plt.plot(spsa_log.loss, 'C0', ls='--', label='SPSA')
plt.plot(qnspsa_log.loss, 'C1', ls='--', label='QN-SPSA')
plt.axhline(-1, c='C3', ls='--', label='target')
plt.ylabel('loss')
plt.xlabel('iterations')
plt.legend();

Out[18]:

We can see that QNSPSA somewhat follows the natural gradient descent curve.

The vanilla and natural gradient costs are linear and quadratic in terms of the number of parameters, while the costs for SPSA and QNSPSA are constant, i.e. independent of the number of parameters. There is the small offset between the costs for SPSA and QNSPSA as more evaluations are required to approximate the natural gradient.

Quick quiz

q-carousel
    q-quiz(goal="qml-training-quizousel-0")
        .question.md The _Gradient_ training method...
        .option(x) updates the circuit parameters using the gradient of the loss function
        .option updates the circuit parameters using the quantum natural gradient of the loss function
        .option updates the circuit parameters using the approximate gradient of the loss function, calculated using random sampling
        .option update the circuit parameters using the approximate quantum natural gradient of the loss function, calculated using random sampling

    q-quiz(goal="qml-training-quizousel-1")
        .question.md The _Natural Gradient_ training method...
        .option updates the circuit parameters using the gradient of the loss function
        .option(x) updates the circuit parameters using the quantum natural gradient of the loss function
        .option updates the circuit parameters using the approximate gradient of the loss function, calculated using random sampling
        .option update the circuit parameters using the approximate quantum natural gradient of the loss function, calculated using random sampling

    q-quiz(goal="qml-training-quizousel-2")
        .question.md The _SPSA_ training method...
        .option updates the circuit parameters using the gradient of the loss function
        .option updates the circuit parameters using the quantum natural gradient of the loss function
        .option(x) updates the circuit parameters using the approximate gradient of the loss function, calculated using random sampling
        .option update the circuit parameters using the approximate quantum natural gradient of the loss function, calculated using random sampling

    q-quiz(goal="qml-training-quizousel-3")
        .question.md The _QNSPSA_ training method...
        .option updates the circuit parameters using the gradient of the loss function
        .option updates the circuit parameters using the quantum natural gradient of the loss function
        .option updates the circuit parameters using the approximate gradient of the loss function, calculated using random sampling
        .option(x) update the circuit parameters using the approximate quantum natural gradient of the loss function, calculated using random sampling

Training in practice

In this era of near-term quantum computing, circuit evaluations are expensive, and readouts are not perfect due to the noisy nature of the devices. Therefore in practice, people often resort to using SPSA. To improve convergence, we don't use a constant learning rate, but an exponentially decreasing one. The diagram below shows the typical convergence between a constant learning rate (dotted lines) versus an exponentially decreasing one (solid lines). We see that the convergence for a constant learning rate is smooth decreasing line, while the convergence for an exponentially decreasing one is steeper and more staggered. This works well if you know what your loss function looks like.

Qiskit will try to automatically calibrate the learning rate to the model if you don't specify the learning rate.

In [19]:

autospsa_log = OptimizerLog()
autospsa = SPSA(maxiter=300,
                learning_rate=None,
                perturbation=None,
                callback=autospsa_log.update)

result = autospsa.minimize(evaluate_expectation, initial_point)

# Plot loss
plt.figure(figsize=(7, 3))
plt.plot(gd_log.loss, 'C0', label='vanilla gradient descent')
plt.plot(qng_log.loss, 'C1', label='quantum natural gradient')
plt.plot(spsa_log.loss, 'C0', ls='--', label='SPSA')
plt.plot(qnspsa_log.loss, 'C1', ls='--', label='QN-SPSA')
plt.plot(autospsa_log.loss, 'C3', label='Power-law SPSA')
plt.axhline(-1, c='C3', ls='--', label='target')
plt.ylabel('loss')
plt.xlabel('iterations')
plt.legend();

Out[19]:

We see here that it works the best of all the methods for this small model. For larger models, the convergence will probably be more like the natural gradient.

Limitations

We've seen that training with gradients works well on the small example model. But can we expect the same if we increase the number of qubits? To investigate that, we measure the variance of the gradients for different model sizes. The idea is simple: if the variance is really small, we don't have enough information to update our parameters.

Exponentially vanishing gradients (barren plateaus)

Let's pick a standard parameterized quantum circuit (RealAmplitudes) and see what happens if we increase the number of qubits and layers, that is increase the width and depth of the circuit) as we calculate the gradient.

In [20]:

from qiskit.opflow import I

def sample_gradients(num_qubits, reps, local=False):
    """Sample the gradient of our model for ``num_qubits`` qubits and
    ``reps`` repetitions.

    We sample 100 times for random parameters and compute the gradient
    of the first RY rotation gate.
    """
    index = num_qubits - 1

    # local or global operator
    if local:
        operator = Z ^ Z ^ (I ^ (num_qubits - 2))
    else:
        operator = Z ^ num_qubits

    # real amplitudes ansatz
    ansatz = RealAmplitudes(num_qubits, entanglement='linear', reps=reps)

    # construct Gradient we want to evaluate for different values
    expectation = StateFn(operator,
                          is_measurement=True).compose(StateFn(ansatz))
    grad = Gradient().convert(expectation,
                              params=ansatz.parameters[index])

    # evaluate for 100 different, random parameter values
    num_points = 100
    grads = []
    for _ in range(num_points):
        # points are uniformly chosen from [0, pi]
        point = np.random.uniform(0, np.pi, ansatz.num_parameters)
        value_dict = dict(zip(ansatz.parameters, point))
        grads.append(sampler.convert(grad, value_dict).eval())

    return grads

Let's plot from 2 to 12 qubits.

In [21]:

num_qubits = list(range(2, 13))
reps = num_qubits  # number of layers = numbers of qubits
gradients = [sample_gradients(n, r) for n, r in zip(num_qubits, reps)]

fit = np.polyfit(num_qubits, np.log(np.var(gradients, axis=1)), deg=1)
x = np.linspace(num_qubits[0], num_qubits[-1], 200)

plt.figure(figsize=(7, 3))
plt.semilogy(num_qubits,
             np.var(gradients, axis=1),
             'o-',
             label='measured variance')
plt.semilogy(x,
             np.exp(fit[0] * x + fit[1]),
             '--', c='C3',
             label=f'exponential fit w/ {fit[0]:.2f}')
plt.xlabel('number of qubits')
plt.ylabel(r'$\mathrm{Var}[\partial_{\theta 1}\langle E(\theta)\rangle]$')
plt.legend(loc='best');

Out[21]:

Oh no! The variance decreases exponentially! This means our gradients contain less and less information and we'll have a hard time to train the model. This is known as the "barren plateau problem", or "exponentially vanishing gradients", discussed in detail in References 4 and 5.

Try it

Do natural gradients have barren plateaus? Try create the above barren plateau plot for natural gradients instead of vanilla gradients in IBM Quantum Lab. You will need to write a new function sample_natural_gradients that computes the natural gradient instead of the gradient.

Is there something we can do about these barren plateaus? It's a hot topic in current research and there are some proposals to mitigate barren plateaus.

Let's have a look at how global and local cost functions and the depth of the ansatz influences the barren plateaus. First, we'll look at short depth, single layer circuits with global operators.

In [22]:

num_qubits = list(range(2, 13))
fixed_depth_global_gradients = [sample_gradients(n, 1) for n in num_qubits]

fit = np.polyfit(num_qubits, np.log(np.var(fixed_depth_global_gradients,
                                           axis=1)), deg=1)
x = np.linspace(num_qubits[0], num_qubits[-1], 200)

plt.figure(figsize=(7, 3))
plt.semilogy(num_qubits,
             np.var(gradients, axis=1),
             'o-',
             label='global cost, linear depth')
plt.semilogy(num_qubits, np.var(fixed_depth_global_gradients, axis=1),
             'o-',
             label='global cost, constant depth')
plt.semilogy(x,
             np.exp(fit[0] * x + fit[1]),
             '--', c='C3',
             label=f'exponential fit w/ {fit[0]:.2f}')
plt.xlabel('number of qubits')
plt.ylabel(r'$\mathrm{Var}[\partial_{\theta 1}\langle E(\theta)\rangle]$')
plt.legend(loc='best');

Out[22]:

We see that short depth, single layer circuits with global operators still give us barren plateaus.

What if we use local operators?

In [23]:

num_qubits = list(range(2, 13))
linear_depth_local_gradients = [sample_gradients(n, n,
                                        local=True) for n in num_qubits]
fit = np.polyfit(num_qubits,
                 np.log(np.var(linear_depth_local_gradients,axis=1)),
                 deg=1)
x = np.linspace(num_qubits[0], num_qubits[-1], 200)

plt.figure(figsize=(7, 3))
plt.semilogy(num_qubits, np.var(gradients, axis=1),
             'o-', label='global cost, linear depth')
plt.semilogy(num_qubits, np.var(fixed_depth_global_gradients, axis=1),
             'o-', label='global cost, constant depth')
plt.semilogy(num_qubits, np.var(linear_depth_local_gradients, axis=1),
             'o-', label='local cost, linear depth')
plt.semilogy(x, np.exp(fit[0] * x + fit[1]), '--', c='C3',
             label=f'exponential fit w/ {fit[0]:.2f}')
plt.xlabel('number of qubits')
plt.ylabel(r'$\mathrm{Var}[\partial_{\theta 1}\langle E(\theta)\rangle]$')
plt.legend(loc='best');

Out[23]:

We see that circuits with local operators still give us barren plateaus.

How about short depth, single layer, circuits with local operators?

In [24]:

num_qubits = list(range(2, 13))
fixed_depth_local_gradients = [sample_gradients(n, 1,
                                        local=True) for n in num_qubits]

fit = np.polyfit(num_qubits,
                 np.log(np.var(fixed_depth_local_gradients, axis=1)),
                 deg=1)

x = np.linspace(num_qubits[0], num_qubits[-1], 200)

plt.figure(figsize=(7, 3))
plt.semilogy(num_qubits, np.var(gradients, axis=1),
             'o-', label='global cost, linear depth')
plt.semilogy(num_qubits, np.var(fixed_depth_global_gradients, axis=1),
             'o-', label='global cost, constant depth')
plt.semilogy(num_qubits, np.var(linear_depth_local_gradients, axis=1),
             'o-', label='local cost, linear depth')
plt.semilogy(num_qubits, np.var(fixed_depth_local_gradients, axis=1),
             'o-', label='local cost, constant depth')
plt.semilogy(x, np.exp(fit[0] * x + fit[1]), '--', c='C3',
             label=f'exponential fit w/ {fit[0]:.2f}')
plt.xlabel('number of qubits')
plt.ylabel(r'$\mathrm{Var}[\partial_{\theta 1}\langle E(\theta)\rangle]$')
plt.legend(loc='best');

Out[24]:

We see that the variance of the local operator, constant depth circuit gradients don't vanish, that is, we don't get barren plateaus. However, these circuits are usually easy to simulate and hence these models won't provide any advantage over classical models.

Quick quiz

The variance of the gradient does not vanish for which types of circuit?

global operator, linear depth

global operator, constant depth

local operator, linear depth

local operator, constant depth

This is the inspiration for layer-wise training, where we start with a basic circuit that may not provide any quantum advantage, with one layer of rotations using local operators. We optimize and fix these parameters, then in the next step, we add a second layer of rotations using local operators, and optimize and fix those, and continue for however many layers we want. This potentially avoids barren plateaus as each optimization step is only using constant depth circuits with local operators.

We can implement this in Qiskit in the following way:

In [25]:

NUM_QUBITS = 6
OPERATOR = Z ^ Z ^ (I ^ (NUM_QUBITS - 4))

def minimize(circuit, optimizer):
    """
    Args:
        circuit (QuantumCircuit): (Partially bound) ansatz circuit to train
        optimizer (Optimizer): Algorithm to use to minimize exp. value
    Returns:
        OptimizerResult: Result of minimization
    """
    initial_point = np.random.random(circuit.num_parameters)

    exp = StateFn(OPERATOR, is_measurement=True) @ StateFn(circuit)
    grad = Gradient().convert(exp)

    exp = PauliExpectation().convert(exp)
    grad = PauliExpectation().convert(grad)

    sampler = CircuitSampler(quantum_instance, caching="all")

    def loss(theta):
        values_dict = dict(zip(circuit.parameters, theta))
        return np.real(sampler.convert(exp, values_dict).eval())

    def gradient(theta):
        values_dict = dict(zip(circuit.parameters, theta))
        return np.real(sampler.convert(grad, values_dict).eval())

    return optimizer.minimize(loss, initial_point, gradient)

def layerwise_training(ansatz, max_num_layers, optimizer):
    """
    Args:
        ansatz (QuantumCircuit): Single circuit layer to train & repeat
        max_num_layers (int): Maximum number of layers
        optimizer (Optimizer): Algorithm to use to minimize exp. value
    Returns:
        float: Lowest value acheived
        list[float]: Best parameters found
    """
    optimal_parameters = []
    for reps in range(max_num_layers):
        ansatz.reps = reps

        # fix the already optimized parameters
        values_dict = dict(zip(ansatz.parameters, optimal_parameters))
        partially_bound = ansatz.bind_parameters(values_dict)

        result = minimize(partially_bound, optimizer)
        optimal_parameters += list(result.x)
        print('Layer:', reps, ' Best Value:', result.fun)

    return result.fun, optimal_parameters

ansatz = RealAmplitudes(4, entanglement='linear')
optimizer = GradientDescent(maxiter=50)

np.random.seed(12)  # for reproducibility
fopt, optimal_parameters = layerwise_training(ansatz, 4, optimizer)

Out[25]:

Layer: 0  Best Value: 0.6162109375
Layer: 1  Best Value: 0.25537109375000006
Layer: 2  Best Value: -0.12353515625
Layer: 3  Best Value: -0.546875

We see that as we increase the circuit depth, our loss function decreases towards -1, so we don't see any barren plateaus.

References

Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac and Nathan Killoran, Evaluating analytic gradients on quantum hardware, Physical Revview A 99, 032331 (2019), doi:10.1103/PhysRevA.99.032331, arXiv:1811.11184.
James Stokes, Josh Izaac, Nathan Killoran and Giuseppe Carleo, Quantum Natural Gradient, Quantum 4, 269 (2020), doi:10.22331/q-2020-05-25-269, arXiv:1909.02108.
Julien Gacon, Christa Zoufal, Giuseppe Carleo and Stefan Woerner, Simultaneous Perturbation Stochastic Approximation of the Quantum Fisher Information, arXiv:2103.09232.
Jarrod R. McClean, Sergio Boixo, Vadim N. Smelyanskiy, Ryan Babbush and Hartmut Neven, Barren plateaus in quantum neural network training landscapes, Nature Communications, Volume 9, 4812 (2018), doi:10.1038/s41467-018-07090-4, arXiv:1803.11173.
M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio and Patrick J. Coles, Cost Function Dependent Barren Plateaus in Shallow Parametrized Quantum Circuits, Nature Communications 12, 1791 (2021), doi:10.1038/s41467-021-21728-w, arXiv:2001.00550.

In [26]:

# pylint: disable=unused-import
import qiskit.tools.jupyter
%qiskit_version_table

Out[26]:

Training parameterized quantum circuits

Introduction

Gradients

Finite difference gradients

Analytic gradients

Natural gradients

Simultaneous Perturbation Stochastic Approximation

Quick quiz

Training in practice

Limitations

Exponentially vanishing gradients (barren plateaus)

Try it

Quick quiz

References

Product

Resources

Company