Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
quantum-kittens
GitHub Repository: quantum-kittens/platypus
Path: blob/main/notebooks/quantum-machine-learning/encoding.ipynb
3855 views
Kernel: Python 3

Data encoding

In this page, we will introduce the problem of data encoding for quantum machine learning, then describe and implement various data encoding methods.

Introduction

Data representation is crucial for the success of machine learning models. For classical machine learning, the problem is how to represent the data numerically, so that it can be best processed by a classical machine learning algorithm.

For quantum machine learning, this question is similar, but more fundamental: how to represent and efficiently input the data into a quantum system, so that it can be processed by a quantum machine learning algorithm. This is usually referred to as data encoding, but is also called data embedding or loading.

This process is a critical part of quantum machine learning algorithms and directly affects their computational power.

Methods

Let's consider a classical dataset X\mathscr{X} consisting of MM samples, each with NN features:

ParseError: KaTeX parse error: Undefined control sequence: \class at position 1: \̲c̲l̲a̲s̲s̲{script-x}{\mat…

where x(m)x^{(m)} is an NN dimensional vector for m=1,...,Mm = 1, ..., M. To represent this dataset in a qubit system, we can use various embedding techniques, some of which are briefly explained and implemented below, as per References 1 and 2.

Basis encoding

Basis encoding associates a classical NN-bit string with a computational basis state of a NN-qubit system. For example, if x=5x = 5, this can be represented as a 44-bit string as 01010101, and by a 44-qubit system as the quantum state 0101|0101\rangle. More generally, for an NN-bit string: x=(b1,b2,...,bN)x = (b_1, b_2, ... , b_N), the corresponding NN-qubit state is ParseError: KaTeX parse error: Undefined control sequence: \cssId at position 1: \̲c̲s̲s̲I̲d̲{ket-x}{| x \ra… with ParseError: KaTeX parse error: Undefined control sequence: \class at position 5: b_n \̲c̲l̲a̲s̲s̲{in}{\in} \{0,1… for n=1,,Nn = 1 , \dots , N.

For the classical dataset X\mathscr{X} described above, to use basis encoding, each data point must be a NN-bit string: x(m)=(b1,b2,...,bN)x^{(m)} = (b_1, b_2, ... , b_N), which then can be mapped directly to the quantum state xm=b1,b2,...,bN|x^{m}\rangle = |b_1, b_2, ... , b_N \rangle with bn{0,1}b_n \in \{0, 1 \} for n=1,...,Nn = 1, ..., N and m=1,...,Mm = 1, ..., M. We can represent the entire dataset as superpositions of computational basis states:

ParseError: KaTeX parse error: Undefined control sequence: \cssId at position 1: \̲c̲s̲s̲I̲d̲{_ket-dataset}{…

Basis encoding

q-statevector-binary-encoding p Add/remove bit strings to/from our input dataset on the left to see how basis encoding encodes this in the state vector on the right.

In Qiskit, once we calculate what state will encode our dataset, we can use the initialize function to prepare it. For example, the dataset X={x(1)=101,x(2)=111}\mathscr{X} = \{x^{(1)}=101, x^{(2)}=111\} is encoded as the state X=12(101+111)|\mathscr{X}\rangle= \frac{1}{\sqrt{2}}(|101\rangle+|111\rangle):

import math from qiskit import QuantumCircuit desired_state = [ 0, 0, 0, 0, 0, 1 / math.sqrt(2), 0, 1 / math.sqrt(2)] qc = QuantumCircuit(3) qc.initialize(desired_state, [0,1,2]) qc.decompose().decompose().decompose().decompose().decompose().draw()
Image in a Jupyter notebook

This example illustrates a couple of disadvantages of basis encoding. While it is simple to understand, the state vectors can become quite sparse, and schemes to implement it are usually not efficient.

Amplitude encoding

Amplitude encoding encodes data into the amplitudes of a quantum state. It represents a normalised classical NN-dimensional data point, xx, as the amplitudes of a nn-qubit quantum state, ψx|\psi_x\rangle:

ψx=i=1Nxii|\psi_x\rangle = \sum_{i=1}^N x_i |i\rangle

where N=2nN = 2^n, xix_i is the ithi^{th} element of xx and i|i\rangle is the ithi^{th} computational basis state.

To encode the classical dataset X\mathscr{X} described above, we concatenate all MM NN-dimensional data points into one amplitude vector, of length N×MN \times M:

ParseError: KaTeX parse error: Undefined control sequence: \cssId at position 8: \alpha=\̲c̲s̲s̲I̲d̲{_a-norm}{A_{\t…

where AnormA_{\text{norm}} is a normalisation constant, such that α2=1|\alpha|^2 = 1. The dataset X\mathscr{X} can now be represented in the computational basis as:

X=i=1Nαii|\mathscr{X}\rangle = \sum_{i=1}^N \alpha_i |i\rangle

where αi\alpha_i are elements of the amplitude vector and i|i\rangle are the computational basis states. The number of amplitudes to be encoded is N×MN \times M. As a system of nn qubits provides 2n2^n amplitudes, amplitude embedding requires nlog2(NM)n \ge \mathrm{log}_2(NM) qubits.

Amplitude encoding

q-statevector-amplitude-encoding p Change the values of the data points on the left, and see how amplitude encoding encodes these as a state vector on the right.

As an example, let's encode the dataset X={x(1)=(1.5,0),x(2)=(2,3)}\mathscr{X}= \{x^{(1)}=(1.5,0), x^{(2)}=(-2,3)\} using amplitude encoding. Concatenating both data points and normalizing the resulting vector, we get:

α=115.25(1.5,0,2,3)\alpha = \frac{1}{\sqrt{15.25}}(1.5,0,-2,3)

and the resulting 2-qubit quantum state would be:

X=115.25(1.500210+311)|\mathscr{X}\rangle = \frac{1}{\sqrt{15.25}}(1.5|00\rangle-2|10\rangle+3|11\rangle)

In the example above, the total number of elements of the amplitude vector, N×MN \times M, is a power of 2. When N×MN \times M is not a power of 2, we can simply choose a value for nn such that 2nMN2^n\geq MN and pad the amplitude vector with uninformative constants.

Like in basis encoding, once we calculate what state will encode our dataset, in Qiskit we can use the initialize function to prepare it:

desired_state = [ 1 / math.sqrt(15.25) * 1.5, 0, 1 / math.sqrt(15.25) * -2, 1 / math.sqrt(15.25) * 3] qc = QuantumCircuit(2) qc.initialize(desired_state, [0,1]) qc.decompose().decompose().decompose().decompose().decompose().draw()
Image in a Jupyter notebook

The advantage of amplitude encoding is that it only requires log2(NM)\mathrm{log}_2(NM) qubits to encode. However, subsequent algorithms must operate on the amplitudes of a quantum state, and methods to prepare and measure the quantum states tend not to be efficient.

Angle encoding

Angle encoding encodes NN features into the rotation angles of nn qubits, where NnN \le n. For example, the data point x=(x1,...,xN)x = (x_1,...,x_N) can be encoded as:

ParseError: KaTeX parse error: Undefined control sequence: \cssId at position 1: \̲c̲s̲s̲I̲d̲{_}{|x\rangle} …

This is different from the previous two encoding methods, as it only encodes one data point at a time, rather than a whole dataset. It does however, only use NN qubits and a constant depth quantum circuit, making it amenable to current quantum hardware.

We can specify angle encoding as a unitary:

ParseError: KaTeX parse error: Undefined control sequence: \class at position 11: S_{x_j} = \̲c̲l̲a̲s̲s̲{_big-o-times-n…

where:

U(xj(i))=[cos(xj(i))sin(xj(i))sin(xj(i))cos(xj(i))]U(x_j^{(i)}) = \begin{bmatrix} \cos(x_j^{(i)}) & -\sin(x_j^{(i)}) \\ \sin(x_j^{(i)}) & \cos(x_j^{(i)}) \\ \end{bmatrix}

Remembering that a single-qubit rotation around the YY-axis is:

RY(θ)=exp(iθ2Y)=(cosθ2sinθ2sinθ2cosθ2)RY(\theta) = \exp(-i \frac{\theta}{2} Y) = \begin{pmatrix} \cos{\frac{\theta}{2}} & -\sin{\frac{\theta}{2}} \\ \sin{\frac{\theta}{2}} & \cos{\frac{\theta}{2}} \end{pmatrix}

We note that U(xj(i))=RY(2xj(i))U(x_j^{(i)}) = RY(2x_j^{(i)}), and as an example, encode the data point x=(0,π/4,π/2)x = (0, \pi/4, \pi/2) using qiskit:

qc = QuantumCircuit(3) qc.ry(0, 0) qc.ry(2*math.pi/4, 1) qc.ry(2*math.pi/2, 2) qc.draw()
Image in a Jupyter notebook

Dense angle encoding is a slight generalization of angle encoding, that encodes two features per qubit using the relative phase, where the data point x=(x1,...,xN)x = (x_1,...,x_N) can be encoded as:

ParseError: KaTeX parse error: Undefined control sequence: \class at position 13: |x\rangle = \̲c̲l̲a̲s̲s̲{_big-o-times-n…

Although the angle and dense angle encoding use sinusoids and exponentials, there is nothing special about these functions. We can easily abstract these to a general class of qubit encodings that use arbitrary functions, or define the encodings as arbitrary unitaries, implemented as parameterized quantum circuits.

Arbitrary encoding

Arbitrary encoding encodes NN features as rotations on NN parameterized gates on nn qubits, where nNn \leq N. Like angle encoding, it only encodes one data point at a time, rather than a whole dataset. It also uses a constant depth quantum circuit and nNn \leq N qubits, meaning it can be run on current quantum hardware.

For example, to use the Qiskit EfficientSU2 circuit to encode 12 features, would only use 3 qubits:

from qiskit.circuit.library import EfficientSU2 circuit = EfficientSU2(num_qubits=3, reps=1, insert_barriers=True) circuit.decompose().draw()
Image in a Jupyter notebook

Here we encode the data point x=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2]x = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2] with 12 features, using each of the parameterized gates to encode a different feature.

x = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2] encode = circuit.bind_parameters(x) encode.decompose().draw()
Image in a Jupyter notebook

The Qiskit ZZFeatureMap circuit with 3 qubits, only encodes a data point of 3 features, despite having 6 parameterized gates:

from qiskit.circuit.library import ZZFeatureMap circuit = ZZFeatureMap(3, reps=1, insert_barriers=True) circuit.decompose().draw()
Image in a Jupyter notebook
x = [0.1, 0.2, 0.3] encode = circuit.bind_parameters(x) encode.decompose().draw()
Image in a Jupyter notebook

Quick quiz

A parameterized quantum circuit has 16 parameters. What is the largest number of features it can encode?

  1. 4

  1. 8

  1. 16

  1. 32

The performance of different parameterized quantum circuits on different types of data is an active area of investigation.

ParseError: KaTeX parse error: Undefined control sequence: \cssId at position 1: \̲c̲s̲s̲I̲d̲{big-o-times}{\…

References

  1. Maria Schuld and Francesco Petruccione, Supervised Learning with Quantum Computers, Springer 2018, doi:10.1007/978-3-319-96424-9.

  2. Ryan LaRose and Brian Coyle, Robust data encodings for quantum classifiers, Physical Review A 102, 032420 (2020), doi:10.1103/PhysRevA.102.032420, arXiv:2003.01695.

# pylint: disable=unused-import import qiskit.tools.jupyter %qiskit_version_table