Path: blob/main/notebooks/summer-school/2021/resources/lab-notebooks/lab-4.ipynb
3855 views

Training Parameterized Quantum Circuits
In this lab session we will have a closer look on how to train circuit-based models. At the end of this session you should know
how to build variational quantum classifiers
how to use different training techniques, especially gradient-based
what restrictions the variational-based models have and how we might overcome them
Where do we need to train parameterized circuits?
Our task

Gradients
Qiskit provides different methods to compute gradients of expectation values, let's explore them!
The parameterized ansatz state is where is given be the following circuit
and our Hamiltonian in this example is
Plugged together to :
To make things concrete, let's fix a point and an index and ask: What's the derivative of the expectation value with respect to parameter at point ?
We'll choose a random point and index (remember we start counting from 0).
Throughout this session we'll use a shot-based simulator with 8192 shots.
Computing expectation values
We'll be using expectation values a lot, so let's recap how that worked in Qiskit.
With the qiskit.opflow module we can write and evaluate expectation values. The general structure for an expectation value where the state is prepared by a circuit is
The above code uses plain matrix multiplication to evaluate the expected value, which is inefficient for large numbers of qubits. Instead, we can use a simulator (or a real quantum device) to evaluate the circuits by using a CircuitSampler in conjunction with an expectation converter like PauliExpectation as
Exercise 1: Calculate the expected value of the following Hamiltonian H and state prepared by the circuit U with plain matrix multiplication.
Exercise 2: Evaluate the expectation value with the QASM simulator by using the QuantumInstance object q_instance defined in the beginning of the notebook.
Finite difference gradients
Arguably the simplest way to approximate gradients is with a finite difference scheme. This works independent of the function's inner, possibly very complex, structure.
Instead of doing this manually, you can use Qiskit's Gradient class for this.
Finite difference gradients can be volatile on noisy functions and using the exact formula for the gradient can be more stable.
Analytic gradients
Luckily there's a the parameter shift rule -- a simple formula for circuit-based gradients. (Note: here only stated for Pauli-rotations without coefficients, see Evaluating analytic gradients on quantum hardware)

Based on the same principle, there's the linear combination of unitaries approach, that uses only one circuit but an additional auxiliary qubit.
Let's try optimizing!
We fixed an initial point for reproducibility.
Similar to how we have a function to evaluate the expectation value, we'll need a function to evaluate the gradient.
To compare the convergence of the optimizers, we can keep track of the loss at each step by using a callback function.

And now we start the optimization and plot the loss!
Does this always work?

Natural gradients



With Qiskit, we can evaluate the natural gradient by using the NaturalGradient class instead of the Gradient!
Analogously to the function to compute gradients, we can now write a function to evaluate the natural gradients.
And as you can see they do indeed differ!
Let's look at how this influences the convergence.
This sounds great! But if we think about how many circuits we need to evaluate these gradients can become costly!

Simultaneous Perturbation Stochastic Approximation
Idea: avoid the expensive gradient evaluation by sampling from the gradients. Since we don't care about the exact values but only about convergence, an unbiased sampling should on average work equally well!

This optimization technique is known as SPSA. How does it perform?
And just at a fraction of the cost!
Can we do the same for natural gradients?


It turns out -- yes, we can!
We'll skip the details here, but the idea is to sample not only from the gradient but to extend this to the quantum Fisher information and thus to the natural gradient.
What are the costs?

Training in practice
Today, evaluating gradients is really expensive and the improved accuracy is not that valuable since we have noisy readout anyways. Therefore, in practice, people often resort to using SPSA. To improve convergence, we do however not use a constant learning rate, but an exponentially decreasing one.
Qiskit will try to automatically calibrate the learning rate to the model if you don't specify the learning rate.
Training a new loss function.
In this exercise we'll train a different loss function than before: A transverse field Ising Hamiltonian on a linear chain with 3 spins
or, spelling out all operations
Exercise 3: Define the Hamiltonian with Opflow's operators. Make sure to add brackets around the tensors, e.g. (X ^ X) + (Z ^ I) for .
Exercise 4: We'll use the EfficientSU2 variational form as ansatz, since now we care about complex amplitudes. Use the evaluate_tfi function, which evaluates the energy of the transverse-field Ising Hamiltonian, to find the optimal parameters with SPSA.
In this cell, use the SPSA optimizer to find the minimum.
Hint: Use the autocalibration (by not specifying the learning rate and perturbation) and 300 iterations to get close enough to the minimum.
How can you use it in VQC/QNNs?
Generating data
We're using some artificial dataset provided by Qiskit's ML package. There are other datasets available, like Iris or Wine, but we'll keep it simple and 2D so we can plot the data.
Building a variational quantum classifier
Let's get the constituents we need:
a Feature Map to encode the data
an Ansatz to train
an Observable to evaluate
We'll use a standard feature map from the circuit library.
We don't care about complex amplitudes so let's use an ansatz with only real amplitudes.
Put together our circuit is
And we'll use global operators for the expectation values.
Classifying our data
Now we understood all the parts, it's time to classify the data. We're using Qiskit's ML package for that and it's OpflowQNN class to describe the circuit and expectation value along with a NeuralNetworkClassifier for the training.
First, we'll do vanilla gradient descent.
Now we can define the classifier, which takes the QNN, the loss and the optimizer.
... and train!
To predict the new labels for the test features we can use the predict method.
And since we know how to use the natural gradients, we can now train with the QNN with natural gradient descent by replacing Gradient with NaturalGradient.
Limits in training circuits
Nothing in life is for free... that includes sufficiently large gradients for all models!
We've seen that training with gradients works well on the small models we tested. But can we expect the same if we increase the number of qubits? To investigate that we measure the variance of the gradients for different model sizes. The idea is simple: if the variance is really small, we don't have enough information to update our parameters.
Exponentially vanishing gradients (Barren plateaus)
Let's pick our favorite example from the gradients and see what happens if we increase the number of qubits and layers.

Let's plot from 2 to 12 qubits.
Oh no! The variance decreases exponentially! This means our gradients contain less and less information and we'll have a hard time to train the model. This is known as the Barren plateau problem or exponentially vanishing gradients.
Exploratory Exercise: Do Natural Gradient suffer from vanishing gradients?
Repeat the Barren plateau plot for natural gradients instead of standard gradients. For this, write a new function sample_natural_gradients that computes the natural gradient instead of the gradient.
Now we'll repeat the experiment to check the variance of the gradients (this cell takes some time to run!)
and plot the results. What do you observe?
What about shorter circuits?
Is there something we can do about these Barren plateaus? It's a hot topic in current research and there are some proposals to mitigate Barren plateaus.
Let's have a look on how global and local cost functions and the depth of the ansatz influences the Barren plateaus.
And what if we use local operators?
Layerwise training

Useful feature: Qiskit's ansatz circuits are mutable!
Summary
