Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/diffusers/stable_diffusion.ipynb
Views: 2535
Stable Diffusion 🎨
...using 🧨diffusers
Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It's trained on 512x512 images from a subset of the LAION-5B database. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on many consumer GPUs. See the model card for more information.
This Colab notebook shows how to use Stable Diffusion with the 🤗 Hugging Face 🧨 Diffusers library.
Let's get started!
1. How to use StableDiffusionPipeline
Before diving into the theoretical aspects of how Stable Diffusion functions, let's try it out a bit 🤗.
In this section, we show how you can run text to image inference in just a few lines of code!
Setup
First, please make sure you are using a GPU runtime to run this notebook, so inference is much faster. If the following command fails, use the Runtime
menu above and select Change runtime type
.
Next, you should install diffusers
as well scipy
, ftfy
and transformers
. accelerate
is used to achieve much faster loading.
Stable Diffusion Pipeline
StableDiffusionPipeline
is an end-to-end inference pipeline that you can use to generate images from text with just a few lines of code.
First, we load the pre-trained weights of all components of the model. In this notebook we use Stable Diffusion version 1.4 (CompVis/stable-diffusion-v1-4), but there are other variants that you may want to try:
stabilityai/stable-diffusion-2-1. This version can produce images with a resolution of 768x768, while the others work at 512x512.
In addition to the model id CompVis/stable-diffusion-v1-4, we're also passing a specific revision
and torch_dtype
to the from_pretrained
method.
We want to ensure that every free Google Colab can run Stable Diffusion, hence we're loading the weights from the half-precision branch fp16
and also tell diffusers
to expect the weights in float16 precision by passing torch_dtype=torch.float16
.
If you want to ensure the highest possible precision, please make sure to remove torch_dtype=torch.float16
at the cost of a higher memory usage.
Next, let's move the pipeline to GPU to have faster inference.
And we are ready to generate images:
Running the above cell multiple times will give you a different image every time. If you want deterministic output you can pass a random seed to the pipeline. Every time you use the same seed you'll have the same image result.
You can change the number of inference steps using the num_inference_steps
argument. In general, results are better the more steps you use. Stable Diffusion, being one of the latest models, works great with a relatively small number of steps, so we recommend to use the default of 50
. If you want faster results you can use a smaller number.
The following cell uses the same seed as before, but with fewer steps. Note how some details, such as the horse's head or the helmet, are less defin realistic and less defined than in the previous image:
The other parameter in the pipeline call is guidance_scale
. It is a way to increase the adherence to the conditional signal which in this case is text as well as overall sample quality. In simple terms classifier free guidance forces the generation to better match with the prompt. Numbers like 7
or 8.5
give good results, if you use a very large number the images might look good, but will be less diverse.
You can learn about the technical details of this parameter in the last section of this notebook.
To generate multiple images for the same prompt, we simply use a list with the same prompt repeated several times. We'll send the list to the pipeline instead of the string we used before.
Let's first write a helper function to display a grid of images. Just run the following cell to create the image_grid
function, or disclose the code if you are interested in how it's done.
Now, we can generate a grid image once having run the pipeline with a list of 3 prompts.
And here's how to generate a grid of n × m
images.
Generate non-square images
Stable Diffusion produces images of 512 × 512
pixels by default. But it's very easy to override the default using the height
and width
arguments, so you can create rectangular images in portrait or landscape ratios.
These are some recommendations to choose good image sizes:
Make sure
height
andwidth
are both multiples of8
.Going below 512 might result in lower quality images.
Going over 512 in both directions will repeat image areas (global coherence is lost).
The best way to create non-square images is to use
512
in one dimension, and a value larger than that in the other one.
2. What is Stable Diffusion
Now, let's go into the theoretical part of Stable Diffusion 👩🎓.
Stable Diffusion is based on a particular type of diffusion model called Latent Diffusion, proposed in High-Resolution Image Synthesis with Latent Diffusion Models.
General diffusion models are machine learning systems that are trained to denoise random gaussian noise step by step, to get to a sample of interest, such as an image. For a more detailed overview of how they work, check this colab.
Diffusion models have shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow. In addition, these models consume a lot of memory because they operate in pixel space, which becomes unreasonably expensive when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference.
Latent diffusion can reduce the memory and compute complexity by applying the diffusion process over a lower dimensional latent space, instead of using the actual pixel space. This is the key difference between standard diffusion and latent diffusion models: in latent diffusion the model is trained to generate latent (compressed) representations of the images.
There are three main components in latent diffusion.
An autoencoder (VAE).
A U-Net.
A text-encoder, e.g. CLIP's Text Encoder.
1. The autoencoder (VAE)
The VAE model has two parts, an encoder and a decoder. The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net model. The decoder, conversely, transforms the latent representation back into an image.
During latent diffusion training, the encoder is used to get the latent representations (latents) of the images for the forward diffusion process, which applies more and more noise at each step. During inference, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder. As we will see during inference we only need the VAE decoder.
2. The U-Net
The U-Net has an encoder part and a decoder part both comprised of ResNet blocks. The encoder compresses an image representation into a lower resolution image representation and the decoder decodes the lower resolution image representation back to the original higher resolution image representation that is supposedly less noisy. More specifically, the U-Net output predicts the noise residual which can be used to compute the predicted denoised image representation.
To prevent the U-Net from losing important information while downsampling, short-cut connections are usually added between the downsampling ResNets of the encoder to the upsampling ResNets of the decoder. Additionally, the stable diffusion U-Net is able to condition its output on text-embeddings via cross-attention layers. The cross-attention layers are added to both the encoder and decoder part of the U-Net usually between ResNet blocks.
3. The Text-encoder
The text-encoder is responsible for transforming the input prompt, e.g. "An astronout riding a horse" into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text-embeddings.
Inspired by Imagen, Stable Diffusion does not train the text-encoder during training and simply uses an CLIP's already trained text encoder, CLIPTextModel.
Why is latent diffusion fast and efficient?
Since the U-Net of latent diffusion models operates on a low dimensional space, it greatly reduces the memory and compute requirements compared to pixel-space diffusion models. For example, the autoencoder used in Stable Diffusion has a reduction factor of 8. This means that an image of shape (3, 512, 512)
becomes (3, 64, 64)
in latent space, which requires 8 × 8 = 64
times less memory.
This is why it's possible to generate 512 × 512
images so quickly, even on 16GB Colab GPUs!
Stable Diffusion during inference
Putting it all together, let's now take a closer look at how the model works in inference by illustrating the logical flow.
The stable diffusion model takes both a latent seed and a text prompt as an input. The latent seed is then used to generate random latent image representations of size where as the text prompt is transformed to text embeddings of size via CLIP's text encoder.
Next the U-Net iteratively denoises the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Many different scheduler algorithms can be used for this computation, each having its pros and cons. For Stable Diffusion, we recommend using one of:
PNDM scheduler (used by default).
DPM Solver Multistep scheduler. This scheduler is able to achieve great quality in less steps. You can try with 25 instead of the default 50!
Theory on how the scheduler algorithm function is out of scope for this notebook, but in short one should remember that they compute the predicted denoised image representation from the previous noise representation and the predicted noise residual. For more information, we recommend looking into Elucidating the Design Space of Diffusion-Based Generative Models
The denoising process is repeated ca. 50 times to step-by-step retrieve better latent image representations. Once complete, the latent image representation is decoded by the decoder part of the variational auto encoder.
After this brief introduction to Latent and Stable Diffusion, let's see how to make advanced use of 🤗 Hugging Face Diffusers!
3. How to write your own inference pipeline with diffusers
Finally, we show how you can create custom diffusion pipelines with diffusers
. This is often very useful to dig a bit deeper into certain functionalities of the system and to potentially switch out certain components.
In this section, we will demonstrate how to use Stable Diffusion with a different scheduler, namely Katherine Crowson's K-LMS scheduler that was added in this PR.
Let's go through the StableDiffusionPipeline
step by step to see how we could have written it ourselves.
We will start by loading the individual models involved.
The pre-trained model includes all the components required to setup a complete diffusion pipeline. They are stored in the following folders:
text_encoder
: Stable Diffusion uses CLIP, but other diffusion models may use other encoders such asBERT
.tokenizer
. It must match the one used by thetext_encoder
model.scheduler
: The scheduling algorithm used to progressively add noise to the image during training.unet
: The model used to generate the latent representation of the input.vae
: Autoencoder module that we'll use to decode latent representations into real images.
We can load the components by referring to the folder they were saved, using the subfolder
argument to from_pretrained
.
Now instead of loading the pre-defined scheduler, we'll use the K-LMS scheduler instead.
Next we move the models to the GPU.
We now define the parameters we'll use to generate images.
Note that guidance_scale
is defined analog to the guidance weight w
of equation (2) in the Imagen paper. guidance_scale == 1
corresponds to doing no classifier-free guidance. Here we set it to 7.5 as also done previously.
In contrast to the previous examples, we set num_inference_steps
to 100 to get an even more defined image.
First, we get the text_embeddings for the prompt. These embeddings will be used to condition the UNet model.
We'll also get the unconditional text embeddings for classifier-free guidance, which are just the embeddings for the padding token (empty text). They need to have the same shape as the conditional text_embeddings
(batch_size
and seq_length
)
For classifier-free guidance, we need to do two forward passes. One with the conditioned input (text_embeddings
), and another with the unconditional embeddings (uncond_embeddings
). In practice, we can concatenate both into a single batch to avoid doing two forward passes.
Generate the intial random noise.
Cool is expected. The model will transform this latent representation (pure noise) into a 512 × 512
image later on.
Next, we initialize the scheduler with our chosen num_inference_steps
. This will compute the sigmas
and exact time step values to be used during the denoising process.
The K-LMS scheduler needs to multiply the latents
by its sigma
values. Let's do this here
We are ready to write the denoising loop.
We now use the vae
to decode the generated latents
back into the image.
And finally, let's convert the image to PIL so we can display or save it.
Now you have all the pieces to build your own pipelines or use diffusers components as you like 🔥.