Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/diffusers/unidiffuser.ipynb
Views: ²⁵³⁵

Kernel: Python 3 (ipykernel)

Generating images and text with UniDiffuser

UniDiffuser was introduced in One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale.

In this notebook, we will show how the UniDiffuser pipeline in 🧨 diffusers can be used for:

Unconditional image generation
Unconditional text generation
Text-to-image generation
Image-to-text generation
Image variation
Text variation

One pipeline to rule six use cases 🤯

Let's start!

Setup

In [1]:

!pip install -q git+https://github.com/huggingface/diffusers
!pip install transformers accelerate -q

  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.5/224.5 kB 18.0 MB/s eta 0:00:00
  Building wheel for diffusers (pyproject.toml) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 104.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 219.1/219.1 kB 27.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 105.7 MB/s eta 0:00:00

Unconditional image and text generation

Throughout this notebook, we'll be using the "thu-ml/unidiffuser-v1" checkpoint. UniDiffuser comes with two checkpoints:

In [2]:

import torch
from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Unconditional image and text generation. The generation task is automatically inferred.
sample = pipe(num_inference_steps=20, guidance_scale=8.0)
image = sample.images[0]
text = sample.text[0]
image.save("unidiffuser_joint_sample_image.png")
print(text)

No inputs or latents have been supplied, and mode has not been manually set, defaulting to mode 'joint'.

/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py:202: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.)
  attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)

A small white car parked up in parking lot

You can also generate only an image or only text (which the UniDiffuser paper calls “marginal” generation since we sample from the marginal distribution of images and text, respectively):

In [3]:

# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance

# Image-only generation
pipe.set_image_mode()
sample_image = pipe(num_inference_steps=20).images[0]

# Text-only generation
pipe.set_text_mode()
sample_text = pipe(num_inference_steps=20).text[0]

To reset a mode, call: pipe.reset_mode().

Text-to-image generation

The UniDiffuserPipeline can infer the right mode of execution from provided inputs to the pipeline called. Since we started with the joint unconditional mode (set_joint_mode()), the subsequent calls will be executed in this model. Now, we want to generate images from text. So, we set the model accordingly.

In [4]:

pipe.set_text_to_image_mode()

In [5]:

# Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

Image-to-text generation

In [6]:

pipe.set_image_to_text_mode()

In [7]:

from diffusers.utils import load_image

# Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

An image of an astronaut flying over the Earth

Image variation

For image variation, we follow a "round-trip" method as suggested in the paper. We first generate a caption from a given image. And then use the caption to generate a image from it.

In [8]:

# Image variation can be performed with a image-to-text generation followed by a text-to-image generation:
# 1. Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

pipe.set_image_to_text_mode()
sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

# 2. Text-to-image generation
pipe.set_text_to_image_mode()
sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0)
final_image = sample.images[0]
final_image.save("unidiffuser_image_variation_sample.png")

An astronaut floating in                                                               

Text variation

The same round-trip methodology can be applied here.

In [9]:

# Text variation can be performed with a text-to-image generation followed by a image-to-text generation:
# 1. Text-to-image generation
prompt = "an elephant under the sea"

pipe.set_text_to_image_mode()
sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

# 2. Image-to-text generation
pipe.set_image_to_text_mode()
sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0)
final_prompt = sample.text[0]
print(final_prompt)

A baby elephant in a aquarium with a fish

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Generating images and text with UniDiffuser

Setup

Unconditional image and text generation

Text-to-image generation

Image-to-text generation

Image variation

Text variation

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Generating images and text with UniDiffuser

Setup

Unconditional image and text generation

Text-to-image generation

Image-to-text generation

Image variation

Text variation

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.