CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/diffusers/unidiffuser.ipynb
Views: 2535
Kernel: Python 3 (ipykernel)

Generating images and text with UniDiffuser

UniDiffuser was introduced in One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale.

In this notebook, we will show how the UniDiffuser pipeline in 🧨 diffusers can be used for:

  • Unconditional image generation

  • Unconditional text generation

  • Text-to-image generation

  • Image-to-text generation

  • Image variation

  • Text variation

One pipeline to rule six use cases 🤯

Let's start!

Setup

!pip install -q git+https://github.com/huggingface/diffusers !pip install transformers accelerate -q
Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.5/224.5 kB 18.0 MB/s eta 0:00:00 Building wheel for diffusers (pyproject.toml) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 104.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 219.1/219.1 kB 27.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 105.7 MB/s eta 0:00:00

Unconditional image and text generation

Throughout this notebook, we'll be using the "thu-ml/unidiffuser-v1" checkpoint. UniDiffuser comes with two checkpoints:

import torch from diffusers import UniDiffuserPipeline device = "cuda" model_id_or_path = "thu-ml/unidiffuser-v1" pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) pipe.to(device) # Unconditional image and text generation. The generation task is automatically inferred. sample = pipe(num_inference_steps=20, guidance_scale=8.0) image = sample.images[0] text = sample.text[0] image.save("unidiffuser_joint_sample_image.png") print(text)
No inputs or latents have been supplied, and mode has not been manually set, defaulting to mode 'joint'.
/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py:202: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:493.) attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
A small white car parked up in parking lot

You can also generate only an image or only text (which the UniDiffuser paper calls “marginal” generation since we sample from the marginal distribution of images and text, respectively):

# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance # Image-only generation pipe.set_image_mode() sample_image = pipe(num_inference_steps=20).images[0] # Text-only generation pipe.set_text_mode() sample_text = pipe(num_inference_steps=20).text[0]

To reset a mode, call: pipe.reset_mode().

Text-to-image generation

The UniDiffuserPipeline can infer the right mode of execution from provided inputs to the pipeline called. Since we started with the joint unconditional mode (set_joint_mode()), the subsequent calls will be executed in this model. Now, we want to generate images from text. So, we set the model accordingly.

pipe.set_text_to_image_mode()
# Text-to-image generation prompt = "an elephant under the sea" sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0) t2i_image = sample.images[0] t2i_image.save("unidiffuser_text2img_sample_image.png")

Image-to-text generation

pipe.set_image_to_text_mode()
from diffusers.utils import load_image # Image-to-text generation image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg" init_image = load_image(image_url).resize((512, 512)) sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0) i2t_text = sample.text[0] print(i2t_text)
An image of an astronaut flying over the Earth

Image variation

For image variation, we follow a "round-trip" method as suggested in the paper. We first generate a caption from a given image. And then use the caption to generate a image from it.

# Image variation can be performed with a image-to-text generation followed by a text-to-image generation: # 1. Image-to-text generation image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg" init_image = load_image(image_url).resize((512, 512)) pipe.set_image_to_text_mode() sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0) i2t_text = sample.text[0] print(i2t_text) # 2. Text-to-image generation pipe.set_text_to_image_mode() sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0) final_image = sample.images[0] final_image.save("unidiffuser_image_variation_sample.png")
An astronaut floating in

Text variation

The same round-trip methodology can be applied here.

# Text variation can be performed with a text-to-image generation followed by a image-to-text generation: # 1. Text-to-image generation prompt = "an elephant under the sea" pipe.set_text_to_image_mode() sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0) t2i_image = sample.images[0] t2i_image.save("unidiffuser_text2img_sample_image.png") # 2. Image-to-text generation pipe.set_image_to_text_mode() sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0) final_prompt = sample.text[0] print(final_prompt)
A baby elephant in a aquarium with a fish