Path: blob/main/ai-for-quantum/01_compiling_unitaries_using_diffusion_models.ipynb
1120 views
AI for Quantum: Compiling Unitaries Using Diffusion Models
AI is a powerful tool for enabling some of the hardest aspects of a hybrid quantum classical workflow including QEC, compilation, calibration, etc. (See the review paper here for more AI for Quantum use cases). Compiling quantum algorithms is an incredible challenge which involves identifying a target unitary, finding an appropriate circuit representation, and then efficiently running the circuit on highly contraining hardware.
In recent papers Quantum circuit synthesis with diffusion models and Synthesis of discrete-continuous quantum circuits with multimodal diffusion models, it was demonstrated how diffusion models can be used for unitary synthesis. This lab will explore the problem of unitary synthesis, introduce a diffusion model used in the work, and allow you to compile circuits of your own using AI.
Pre-requisites: No experience with diffusion models is necessary. However, this notebook will not provide a detailed discussion on diffusion models or their construction. For curious readers, we suggest NVIDIA's Deep Learning Institute course on diffusion models. As far as quantum prerequisites, familiarity with the basics of quantum computing like gates, state vectors, etc. is required. If you are not familiar with these concepts, please complete the Quick Start to Quantum Computing course first.
What you'll do:
Learn the basics of unitary synthesis and try to compile a unitary by hand
Encode quantum circuits as inputs for the diffusion model
Synthesize quantum circuits corresponding to a given unitary matrix with a diffusion model
Evaluate if the obtained circuit is accurate
Filter better quantum circuits
Sample a circuit using a noise model
🎥 You can watch a recording of the presentation of this notebook from a GTC DC tutorial in October 2025.
Let's begin with installing the relevant packages.
The Challenge of Unitary Synthesis and Compilation
In a sense, quantum computing is extremely simple, corresponding to the multiplication of a unitary matrix with a state vector to produce the desired quantum state that solves a problem. In the example below we use the ordering . The initial state vector is and the state produced after multiplying by the unitary matrix is .
The quantum circuit drawn below represents a synthesis of this unitary matrix. That is, it produces the same result as multiplying by above, regardless of the initial state.

Wrapped up in this simple picture of unitary matrices and quantum circuits is incredible complexity which makes quantum computing so difficult.
Scaling: First, the unitary matrix corresponding to a quantum circuit is huge, with entries where is the number of qubits in the circuit. The matrix cannot be stored naively on any classical computer in its entirely for more than about 25 qubits.
Identifying the unitary: Second, it is far from obvious in many cases what particular unitary matrix will solve a problem. Consider methods like VQE where the entire goal is to identify what sort of parameterized circuit (unitary matrix) solves the given problem.
Executing on a quantum device: Finally, even if the unitary required is known, implementing it on a physical QPU requires it to be synthesized (or compiled) into a set of discrete gate operations compatible with the device. Furthermore decisions needs to be made concerning how these gates are performed and in which order to ensure that performance is achieved and bottlenecks are avoided.
This is extremely challenging and gets even worse when considering the fact that different QPUs have different gate sets and hardware constraints, quantum error correction protocols add addition overhead, and time constraints require not only that an accurate circuit be synthesized, but that it is as simple as possible.
It is no wonder why circuit synthesis is considered a leading AI for quantum use case as AI's aptitude for complex pattern recognition could provide a powerful means for compiling the unitaries necessary to run quantum algorithms at scale.
In this lab, you will explore unitary synthesis and learn how to generate valid circuits given a target unitary.
Exercise 1
To get a sense for how difficult compilation is, try to compile the state of a single qubit by hand with this interactive game. Instructions for the game: You are given a random unitary and presented with two Bloch spheres depicting its action on the and states. Your job is to apply gate operations to get as close as possible to the target unitary. You will notice, that even when you can see exactly what each gate does, it is not obvious how to match the initial state exactly. Even if action on a single state is correct, the unitary may still be incorrect as it must properly operate on all basis states.
An Overview of the Diffusion Model
Though many AI techniques have been explored for circuit synthesis, the rest of this lab will look at recent work (Fürrutter, et al., 2024) that used diffusion models for the task. We'll begin with a general overview of diffusion models and then discuss the specific advantages they offer for circuit synthesis.
This section is not a comprehensive or particularly deep lesson on diffusion models for which we point the reader to this course. To build intuition, we will first use the common example of image generation before applying these concepts to our main topic of unitary synthesis.
The Core Idea: Denoising Images
The primary objective of a diffusion model is to generate high-quality samples by learning to reverse a noise-adding process, rather than learning the data distribution directly. The training begins with a clean dataset — in this case, images — to which Gaussian noise is incrementally added in a "forward process."
To reverse this, the model employs a U-net architecture to learn the "reverse process" of denoising. The U-net is trained to take a noisy image as input and predict the specific noise pattern that was added to it. The model's parameters are optimized by minimizing a loss function that measures the difference between the predicted noise and the actual noise.
It's important to note that the U-net is named for its U-shaped layer structure; this is purely an architectural descriptor and has no relation to the mathematical symbol for a unitary matrix. In the final stage, called inference, the model acts like an artist who starts with a block of static and "chisels away" the noise to reveal a clear image.
In the final stage, called inference, the model acts like an artist who starts with a block of static and "chisels away" the noise to reveal a clear image.
Exercise 2
Try this widget to get some hands on experience for the diffusion model process. The widget is grossly oversimplified, but gives a visual representation of what is happening in the training and inference stages of a diffusion model. You'll first see how an image is deliberately corrupted with noise for training. Then, you'll watch the trained model take a fresh patch of random noise and reverse the process, generating a clean new image from scratch.
Applying Diffusion to Unitary Synthesis
The Core Idea: Denoising Circuits
Now, let's apply the same concepts of noising and denoising to our primary goal: unitary compilation. The process follows the diagram below. First, training circuits are embedded into a data structure amenable to the neural network. Then, just like with the images, noise is added to the training data and it is input into the U-net model. The model is also given the target unitary matrix and any specific constraints (e.g., which gates to use). The output of the U-Net model is the predicted noise, and it is trained until its prediction is as accurate as possible.

The inference step (shown below) then uses this trained model. It takes a target unitary, compilation instructions, and random noise as input. The model then "denoises" this input to produce candidate circuits that implement the target unitary.

In a sense, the process is simple and can be treated as a black box. But there are also many challenges, such as ensuring sufficient quality and quantity of training data, choosing the right model architecture, and deciding how data is encoded.
The primary advantage of this approach for quantum circuit compilation is that the diffusion model learns how to denoise corrupted samples, not the distribution of the circuits themselves. Most other approaches require generating sample circuits and then comparing their behavior to the target. Such a requirement is extremely expensive, as it would require running many quantum circuit simulations, which limits scalability.
Preparing Quantum Circuit Data for the Model
An important consideration for all AI applications is how the data is preprocessed before being input to the model. In this section we will explore a piece of this process related to encoding the quantum circuit. That is, representing the quantum circuit in such a way that is amenable to AI. Note that the target unitary and text prompt inputs are themselves prepared with distinct neural networks which will not be discussed here.
The figure below explains how we translate a quantum circuit diagram into a numerical, or tokenized matrix. Think of the matrix as a timeline of the circuit. Each row is a dedicated qubit, and each column is a step in time, moving from left to right and top to bottom. We fill the matrix using a codebook, or vocabulary, where each gate has a unique number (e.g., , ). For gates involving multiple qubits, we use a negative sign to mark the "control" qubit. For instance, the Hadamard gate on qubit is encoded as the column vector . A CNOT gate with a control on and a target on is represented by the column . The example circuit shown results in a matrix, which includes columns for gate operations and columns of zeroes for padding to signify the end of the circuit.

For improved numerical stability during model training, the discrete tokenized matrix is embedded into a continuous tensor. The idea is to replace every integer in our matrix, including , with a vector chosen from a specially prepared set of orthonormal basis vectors of dimension . This conversion is vital for our diffusion model to perform well.
To illustrate, consider an embedding space of dimension with a fixed orthonormal basis . Suppose and . Then the tokenized column , which represents a Hadamard gate on , is transformed into the tensor:
Exercise 3
Write a function to encode the following circuit as a tensor using the vocabulary: , , , . Signal the end of the circuit with two columns of .

Decoding the Generated Tensors
The diffusion model is trained to generate new tensors. For example suppose the diffusion model generated a tensor whose first element was . This must be decoded back into an integer like those in a tokenized matrix to be interpretable as a quantum circuit.
This decoding is performed on each vector of the output tensor in a two-step process to determine the corresponding integer token. First, we identify the best-matching basis vector from the vocabulary by finding which one maximizes the absolute value of the cosine similarity with the generated vector. The index of this basis vector, , gives us the magnitude of our token.
Second, we determine the token's sign by computing the standard cosine similarity between the generated vector and the winning basis vector, . The sign of this result becomes the sign of the token.
Therefore, if a generated vector is found to be closest to basis vector , and their cosine similarity is negative, the decoded entry in the tokenized matrix becomes .
By repeating this for every vector in the generated tensor, we reconstruct the entire tokenized matrix, which gives us the blueprint for a new quantum circuit, as depicted below.

Exercise 4
Write a function below to decode your tensor from Exercise 3 and recover the original tokenized matrix.
The genqc function then translates this decoded matrix into a quantum kernel using the specified mapping between gates and integers stored as the vocab vector.
Similar pre and postprocessing steps are present in all AI applications. When developing AI for quantum applications it is key to find clever ways to encode information such that it can be effectively processed by the AI model.
Generating Circuits with the Diffusion Model
Now that we've covered the problem setup and data processing, let's put the theory into practice using a pretrained model. While the training process itself is a fascinating topic, we'll focus on using the model here. You can explore training in more detail in these courses on the basics of AI and diffusion models.
The first step is to select a unitary to compile. This model has been trained to compile unitaries arising from circuits composed of the gates ['h', 'cx', 'z', 'x', 'ccx', 'swap']. Although this is a universal gate set, meaning it contains enough operations needed to construct any possible quantum circuit, performing arbitrary computations requires an incredible number of gates. For this tutorial, we will use a model trained to generate kernels with at most 12 gates. Therefore, we can only expect the model to work for unitaries under this constraint. Let's consider here the compilation of one such unitary.
We start by defining our unitary as a numpy.array:
Next, run the cell below to prepare a torch device with CUDA if available.
Then, load the pretrained model from Hugging Face.
Next, we set the parameters the model was trained on. Note that these are fixed and depend on the pre-trained model. The gate types are pulled from pipeline.gate_pool and are used to build a vocabulary.
The model can compile circuits composed with any subset of these gates as long as the proper "Compile using: [,,,]" format is used.
The code below will now use this prompt and the unitary (U) defined above to sample (or generate) 128 circuits. Because the neural network can only process real numbers, we first split the unitary matrix U into its real and imaginary components and then combine them into a single input tensor. The infer_comp.generate_comp_tensors command calls the inference procedure and produces a set of output matrices (out_matrices) representing the circuit samples.
The matrix for the first circuit generated by the model is printed below.
Converting matrices to CUDA-Q kernels
Next, we convert each generated matrix into a cudaq.kernel.
It is possible that some of the generated matrices might not correspond to a valid kernel. For example, a generated matrix might have encoded a CNOT gate with two controls and no target (i.e., a column of the matrix might be ), and another generated matrix may have encoded an and gate, applied separately and simultaneously to two qubits at step 1 (i.e., the first column of the matrix might be ). Neither of these are meaningful quantum kernels. Therefore, in the next code block, we filter out only the valid matrices.
For example, the following generated matrix
corresponds to the following cudaq.kernel
Our first filter removed circuits that were structurally invalid, but this doesn't guarantee the remaining ones are correct. Think of it as checking for spelling errors before checking for meaning. Now, in the next section, we'll perform that second check: filtering for the circuits that actually approximate the target unitary.
Evaluating Sampled Unitaries
As mentioned earlier, one of the key advantages of using diffusion models (DMs) as a unitary compiler is the ability to rapidly sample many circuits. However, as is common in machine learning, the model has a certain accuracy, meaning not all generated circuits are expected to exactly compile the specified unitary. In this section, you will evaluate how many of the generated circuits are indeed correct and then perform post-selection to identify (at least) one circuit that successfully performs the desired unitary operation.
First, calculate the unitary matrix implemented by each of the kernels. The elements of this matrix are defined by the transition amplitudes between the basis states, which can be expressed as: where and are computational basis states (typically in the -basis), with representing the standard basis vector of dimension that has a in the position and elsewhere.
Exercise 5
Write a function to compute the expression above from the CUDA-Q kernel. Compute the unitaries for each of the 128 sampled circuits.
For example, the circuit printed above corresponds to the following unitary:
Now that we have the unitaries for each of the kernels, we compare them to the user provided unitary matrix, U. To do so, we compute the infidelity between the exact unitary and the generated ones. The infidelity is defined as follows:
The infidelity is a value between 0 and 1, where 0 indicates that the unitaries are identical (up to a global phase).
Exercise 5
Compute the infidelities for each sampled unitary and plot a histogram based on infidelity. How many circuits had a near zero infidelity?
The circuit with the lowest infidelity is printed below.
which, as we can see, exactly compiled our targeted unitary:
Select a circuit that meets specific criteria
As you have seen above, you now have almost 30 kernels that compile the desired unitary! This is particularly valuable when dealing with hardware constraints, where, for instance, you might want to avoid using certain qubits or specific gates. Here are a few scenarios where these sorts of choices matter. The rest of the notebook will work through the first case, but you can come back and work through any of these preferences.
A common practice for reducing circuit overhead is to minimize the number of Toffoli gates, as they are particularly costly and error-prone due to the large number of non-Clifford T gates required for their implementation.
Certain QPUs like neutral atom and superconducting processors can trivially implement gates using software control, while gates require a more error prone pulse. Thus, for these modalities, it is favorable to produce circuits with a bias towards gates over gates, holding the number of two qubit gates constant.
When considering quantum error correction (QEC), the type of QEC code can dictate which types of gates are transversal, meaning they can be trivially applied to all data qubits to produce the logical gate. This can change from code to code, so selecting circuits which maximize the number of transversal gates is a ideal. This can even mean favoring transversal CNOT gates over non-transversal single qubit gates.
Going back to example 1 minimizing the number of Toffoli gates (ccx), let's sort our valid circuits for those with few ccx gates.
It appears that the diffusion model requires at least one Toffoli gate to compile the unitary. You can now print a few of these circuits to select the one that best suits the situation or to identify any noteworthy patterns the model employs for this specific unitary.
Compiling Noisy Circuits
In this section, we'll define a noise_model and verify that a lower number of ccx gates yields better results under this noise model. For more details, see the Noisy Simulation example in CUDA-Q documentation.
The cell below defines a depolarizing noise channel and applies it to all ccx and cx gates.
To simulate a noisy circuit, using a density matrix simulator is convenient. To call up the density matrix simulator, simply change the target with cudaq.set_target("density-matrix-cpu").
The cudaq.sample function can take a noise model as an argument to perform a simulation with noise: cudaq.sample(kernel, noise_model=noise_model).
This histogram represents why unitary compilation is so important. With a small three qubit example, running three circuits that produce the exact same unitary, the sampled circuits with more multi-qubit gates produce inferior results. Consider that for a fully scaled up application, good compilation might be the difference between success and a meaningless output or infeasible runtime.
Summary
AI has the potential to be a powerful tool for compilation especially at scale. Researchers might be able to use such a tool to better understand the impacts of device noise or identify patterns which make for more favorable circuits. The AI workflow you explored today is also highly flexible. It can consider different gate sets, circuit lengths, and many other refinements to improve results. Keep an eye out for future research in this space as different AI techniques are applied to more complex quantum circuit compilation tasks.