CLIP Guided Stable Diffusion using d🧨ffusers

This notebook shows how to do CLIP guidance with Stable diffusion using diffusers libray. This allows you to use newly released CLIP models by LAION AI..

This notebook is based on the following amazing repos, all credits to the original authors!

Initial Setup

In [1]:

#@title Instal dependancies
!pip install -qqq diffusers==0.11.1 transformers ftfy gradio accelerate

Out[1]:

     |████████████████████████████████| 229 kB 8.9 MB/s 
     |████████████████████████████████| 4.9 MB 68.2 MB/s 
     |████████████████████████████████| 53 kB 2.2 MB/s 
     |████████████████████████████████| 5.3 MB 51.1 MB/s 
     |████████████████████████████████| 163 kB 68.4 MB/s 
     |████████████████████████████████| 6.6 MB 46.5 MB/s 
     |████████████████████████████████| 55 kB 4.4 MB/s 
     |████████████████████████████████| 2.3 MB 56.0 MB/s 
     |████████████████████████████████| 57 kB 6.2 MB/s 
     |████████████████████████████████| 270 kB 48.4 MB/s 
     |████████████████████████████████| 112 kB 58.6 MB/s 
     |████████████████████████████████| 84 kB 4.2 MB/s 
     |████████████████████████████████| 54 kB 3.8 MB/s 
     |████████████████████████████████| 84 kB 2.4 MB/s 
     |████████████████████████████████| 212 kB 12.2 MB/s 
     |████████████████████████████████| 63 kB 2.7 MB/s 
     |████████████████████████████████| 80 kB 11.6 MB/s 
     |████████████████████████████████| 68 kB 8.0 MB/s 
     |████████████████████████████████| 46 kB 5.1 MB/s 
     |████████████████████████████████| 594 kB 68.0 MB/s 
     |████████████████████████████████| 4.0 MB 52.4 MB/s 
     |████████████████████████████████| 856 kB 66.2 MB/s 
  Building wheel for ffmpy (setup.py) ... done
  Building wheel for python-multipart (setup.py) ... done

Authenticate with Hugging Face Hub

To use private and gated models on 🤗 Hugging Face Hub, login is required. If you are only using a public checkpoint (such as CompVis/stable-diffusion-v1-4 in this notebook), you can skip this step.

In [2]:

#@title Login
from huggingface_hub import notebook_login

notebook_login()

Out[2]:

Login successful
Your token has been saved to /root/.huggingface/token

CLIP Guided Stable Diffusion

In [ ]:

#@title Load the pipeline
import torch
from PIL import Image

from diffusers import LMSDiscreteScheduler, DiffusionPipeline, PNDMScheduler
from transformers import CLIPFeatureExtractor, CLIPModel

model_id = "CompVis/stable-diffusion-v1-4" #@param {type: "string"}
clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K" #@param ["laion/CLIP-ViT-B-32-laion2B-s34B-b79K", "laion/CLIP-ViT-L-14-laion2B-s32B-b82K", "laion/CLIP-ViT-H-14-laion2B-s32B-b79K", "laion/CLIP-ViT-g-14-laion2B-s12B-b42K", "openai/clip-vit-base-patch32", "openai/clip-vit-base-patch16", "openai/clip-vit-large-patch14"] {allow-input: true}
scheduler = "plms" #@param ['plms', 'lms']


def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

if scheduler == "lms":
    scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear")
else:
    scheduler = PNDMScheduler.from_config(model_id, subfolder="scheduler")


feature_extractor = CLIPFeatureExtractor.from_pretrained(clip_model_id)
clip_model = CLIPModel.from_pretrained(clip_model_id, torch_dtype=torch.float16)


guided_pipeline = DiffusionPipeline.from_pretrained(
    model_id,
    custom_pipeline="clip_guided_stable_diffusion",
    custom_revision="main",  # TODO: remove if diffusers>=0.12.0
    clip_model=clip_model,
    feature_extractor=feature_extractor,
    scheduler=scheduler,
    torch_dtype=torch.float16,
)
guided_pipeline = guided_pipeline.to("cuda")

In [ ]:

#@title Generate with Gradio Demo

import gradio as gr

import torch
from torch import autocast
from diffusers import StableDiffusionPipeline
from PIL import Image  


last_model = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
def infer(prompt, clip_prompt, samples, steps, clip_scale, scale, seed, clip_model, use_cutouts, num_cutouts):
    global last_model
    print(last_model)
    if(last_model == clip_model):
      guided_pipeline = create_clip_guided_pipeline(model_id, clip_model_id)
      guided_pipeline = guided_pipeline.to("cuda")
      last_model = clip_model
    prompt = prompt
    clip_prompt = clip_prompt
    num_samples = samples
    num_inference_steps = steps
    guidance_scale = scale
    clip_guidance_scale = clip_scale 
    if(use_cutouts):
      use_cutouts = "True"
    else:
       use_cutouts = "False"
    unfreeze_unet = "True" 
    unfreeze_vae = "True" 
    seed = seed

    if unfreeze_unet == "True":
        guided_pipeline.unfreeze_unet()
    else:
        guided_pipeline.freeze_unet()

    if unfreeze_vae == "True":
        guided_pipeline.unfreeze_vae()
    else:
        guided_pipeline.freeze_vae()

    generator = torch.Generator(device="cuda").manual_seed(seed)

    images = []
    for i in range(num_samples):
        image = guided_pipeline(
            prompt,
            clip_prompt=clip_prompt if clip_prompt.strip() != "" else None,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale, 
            clip_guidance_scale=clip_guidance_scale,
            num_cutouts=num_cutouts,
            use_cutouts=use_cutouts == "True",
            generator=generator,
        ).images[0]
        images.append(image)

    #image_grid(images, 1, num_samples)
    return images
    
css = """
        .gradio-container {
            font-family: 'IBM Plex Sans', sans-serif;
        }
        .gr-button {
            color: white;
            border-color: black;
            background: black;
        }
        input[type='range'] {
            accent-color: black;
        }
        .dark input[type='range'] {
            accent-color: #dfdfdf;
        }
        .container {
            max-width: 730px;
            margin: auto;
            padding-top: 1.5rem;
        }
        #gallery {
            min-height: 22rem;
            margin-bottom: 15px;
            margin-left: auto;
            margin-right: auto;
            border-bottom-right-radius: .5rem !important;
            border-bottom-left-radius: .5rem !important;
        }
        #gallery>div>.h-full {
            min-height: 20rem;
        }
        .details:hover {
            text-decoration: underline;
        }
        .gr-button {
            white-space: nowrap;
        }
        .gr-button:focus {
            border-color: rgb(147 197 253 / var(--tw-border-opacity));
            outline: none;
            box-shadow: var(--tw-ring-offset-shadow), var(--tw-ring-shadow), var(--tw-shadow, 0 0 #0000);
            --tw-border-opacity: 1;
            --tw-ring-offset-shadow: var(--tw-ring-inset) 0 0 0 var(--tw-ring-offset-width) var(--tw-ring-offset-color);
            --tw-ring-shadow: var(--tw-ring-inset) 0 0 0 calc(3px var(--tw-ring-offset-width)) var(--tw-ring-color);
            --tw-ring-color: rgb(191 219 254 / var(--tw-ring-opacity));
            --tw-ring-opacity: .5;
        }
        #advanced-btn {
            font-size: .7rem !important;
            line-height: 19px;
            margin-top: 12px;
            margin-bottom: 12px;
            padding: 2px 8px;
            border-radius: 14px !important;
        }
        #advanced-options {
            display: none;
            margin-bottom: 20px;
        }
        .footer {
            margin-bottom: 45px;
            margin-top: 35px;
            text-align: center;
            border-bottom: 1px solid #e5e5e5;
        }
        .footer>p {
            font-size: .8rem;
            display: inline-block;
            padding: 0 10px;
            transform: translateY(10px);
            background: white;
        }
        .dark .footer {
            border-color: #303030;
        }
        .dark .footer>p {
            background: #0b0f19;
        }
        .acknowledgments h4{
            margin: 1.25em 0 .25em 0;
            font-weight: bold;
            font-size: 115%;
        }
"""

block = gr.Blocks(css=css)

examples = [
    [
        'A high tech solarpunk utopia in the Amazon rainforest',
        2,
        45,
        7.5,
        1024,
    ],
    [
        'A pikachu fine dining with a view to the Eiffel Tower',
        2,
        45,
        7,
        1024,
    ],
    [
        'A mecha robot in a favela in expressionist style',
        2,
        45,
        7,
        1024,
    ],
    [
        'an insect robot preparing a delicious meal',
        2,
        45,
        7,
        1024,
    ],
    [
        "A small cabin on top of a snowy mountain in the style of Disney, artstation",
        2,
        45,
        7,
        1024,
    ],
]

with block:
    gr.HTML(
        """
            <div style="text-align: center; max-width: 650px; margin: 0 auto;">
              <div
                style="
                  display: inline-flex;
                  align-items: center;
                  gap: 0.8rem;
                  font-size: 1.75rem;
                "
              >
                <svg
                  width="0.65em"
                  height="0.65em"
                  viewBox="0 0 115 115"
                  fill="none"
                  xmlns="http://www.w3.org/2000/svg"
                >
                  <rect width="23" height="23" fill="white"></rect>
                  <rect y="69" width="23" height="23" fill="white"></rect>
                  <rect x="23" width="23" height="23" fill="#AEAEAE"></rect>
                  <rect x="23" y="69" width="23" height="23" fill="#AEAEAE"></rect>
                  <rect x="46" width="23" height="23" fill="white"></rect>
                  <rect x="46" y="69" width="23" height="23" fill="white"></rect>
                  <rect x="69" width="23" height="23" fill="black"></rect>
                  <rect x="69" y="69" width="23" height="23" fill="black"></rect>
                  <rect x="92" width="23" height="23" fill="#D9D9D9"></rect>
                  <rect x="92" y="69" width="23" height="23" fill="#AEAEAE"></rect>
                  <rect x="115" y="46" width="23" height="23" fill="white"></rect>
                  <rect x="115" y="115" width="23" height="23" fill="white"></rect>
                  <rect x="115" y="69" width="23" height="23" fill="#D9D9D9"></rect>
                  <rect x="92" y="46" width="23" height="23" fill="#AEAEAE"></rect>
                  <rect x="92" y="115" width="23" height="23" fill="#AEAEAE"></rect>
                  <rect x="92" y="69" width="23" height="23" fill="white"></rect>
                  <rect x="69" y="46" width="23" height="23" fill="white"></rect>
                  <rect x="69" y="115" width="23" height="23" fill="white"></rect>
                  <rect x="69" y="69" width="23" height="23" fill="#D9D9D9"></rect>
                  <rect x="46" y="46" width="23" height="23" fill="black"></rect>
                  <rect x="46" y="115" width="23" height="23" fill="black"></rect>
                  <rect x="46" y="69" width="23" height="23" fill="black"></rect>
                  <rect x="23" y="46" width="23" height="23" fill="#D9D9D9"></rect>
                  <rect x="23" y="115" width="23" height="23" fill="#AEAEAE"></rect>
                  <rect x="23" y="69" width="23" height="23" fill="black"></rect>
                </svg>
                <h1 style="font-weight: 900; margin-bottom: 7px;">
                  CLIP Guided Stable Diffusion Demo
                </h1>
              </div>
              <p style="margin-bottom: 10px; font-size: 94%">
               Demo allows you to use newly released <a href="https://huggingface.co/laion" style="text-decoration: underline">CLIP models by LAION AI</a> with Stable Diffusion
              </p>
            </div>
        """
    )
    with gr.Group():
        with gr.Box():
            with gr.Row().style(mobile_collapse=False, equal_height=True):
                text = gr.Textbox(
                    label="Enter your prompt",
                    show_label=False,
                    max_lines=1,
                    placeholder="Enter your prompt",
                ).style(
                    border=(True, False, True, True),
                    rounded=(True, False, False, True),
                    container=False,
                )
                btn = gr.Button("Generate image").style(
                    margin=False,
                    rounded=(False, True, True, False),
                )

        gallery = gr.Gallery(
            label="Generated images", show_label=False, elem_id="gallery"
        ).style(grid=[2], height="auto")

        advanced_button = gr.Button("Advanced options", elem_id="advanced-btn")

        with gr.Row(elem_id="advanced-options"):
            with gr.Column():
              clip_prompt = gr.Textbox(
                      label="Enter a CLIP prompt if you want it to differ",
                      show_label=False,
                      max_lines=1,
                      placeholder="Enter a CLIP prompt if you want it to differ",
              )
              with gr.Row():
                samples = gr.Slider(label="Images", minimum=1, maximum=2, value=1, step=1)
                steps = gr.Slider(label="Steps", minimum=1, maximum=50, value=45, step=1)
              with gr.Row():
                use_cutouts = gr.Checkbox(label="Use cutouts?")
                num_cutouts = gr.Slider(label="Cutouts", minimum=1, maximum=16, value=4, step=1)
            with gr.Row():
              with gr.Column():
                clip_model = gr.Dropdown(["laion/CLIP-ViT-B-32-laion2B-s34B-b79K", "laion/CLIP-ViT-L-14-laion2B-s32B-b82K", "laion/CLIP-ViT-H-14-laion2B-s32B-b79K", "laion/CLIP-ViT-g-14-laion2B-s12B-b42K", "openai/clip-vit-base-patch32", "openai/clip-vit-base-patch16", "openai/clip-vit-large-patch14"], value="laion/CLIP-ViT-B-32-laion2B-s34B-b79K", show_label=False)
                with gr.Row():
                  scale = gr.Slider(
                      label="Guidance Scale", minimum=0, maximum=50, value=7.5, step=0.1
                  )
                  seed = gr.Slider(
                      label="Seed",
                      minimum=0,
                      maximum=2147483647,
                      step=1,
                      randomize=True,
                  )
                  clip_scale = gr.Slider(
                      label="CLIP Guidance Scale", minimum=0, maximum=5000, value=100, step=1
                  )

        ex = gr.Examples(examples=examples, fn=infer, inputs=[text, samples, steps, scale, clip_scale, seed], outputs=gallery, cache_examples=False)
        ex.dataset.headers = [""]

        
        text.submit(infer, inputs=[text, clip_prompt, samples, steps, scale, clip_scale, seed, clip_model, use_cutouts, num_cutouts], outputs=gallery)
        btn.click(infer, inputs=[text, clip_prompt, samples, steps, scale, clip_scale, seed, clip_model, use_cutouts, num_cutouts], outputs=gallery)
        advanced_button.click(
            None,
            [],
            text,
            _js="""
            () => {
                const options = document.querySelector("body > gradio-app").querySelector("#advanced-options");
                options.style.display = ["none", ""].includes(options.style.display) ? "flex" : "none";
            }""",
        )
        gr.HTML(
            """
                <div class="footer">
                    <p>Model by <a href="https://huggingface.co/CompVis" style="text-decoration: underline;" target="_blank">CompVis</a> and <a href="https://huggingface.co/stabilityai" style="text-decoration: underline;" target="_blank">Stability AI</a> - Gradio Demo by 🤗 Hugging Face
                    </p>
                </div>
                <div class="acknowledgments">
                    <p><h4>LICENSE</h4>
The model is licensed with a <a href="https://huggingface.co/spaces/CompVis/stable-diffusion-license" style="text-decoration: underline;" target="_blank">CreativeML Open RAIL-M</a> license. The authors claim no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in this license. The license forbids you from sharing any content that violates any laws, produce any harm to a person, disseminate any personal information that would be meant for harm, spread misinformation and target vulnerable groups. For the full list of restrictions please <a href="https://huggingface.co/spaces/CompVis/stable-diffusion-license" target="_blank" style="text-decoration: underline;" target="_blank">read the license</a></p>
                    <p><h4>Biases and content acknowledgment</h4>
Despite how impressive being able to turn text into image is, beware to the fact that this model may output content that reinforces or exacerbates societal biases, as well as realistic faces, pornography and violence. The model was trained on the <a href="https://laion.ai/blog/laion-5b/" style="text-decoration: underline;" target="_blank">LAION-5B dataset</a>, which scraped non-curated image-text-pairs from the internet (the exception being the removal of illegal content) and is meant for research purposes. You can read more in the <a href="https://huggingface.co/CompVis/stable-diffusion-v1-4" style="text-decoration: underline;" target="_blank">model card</a></p>
               </div>
           """
        )

block.launch(debug=True)

In [ ]:

#@title Generate on Colab

prompt = "fantasy book cover, full moon, fantasy forest landscape, golden vector elements, fantasy magic, dark light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Albert Bierstadt, masterpiece" #@param {type: "string"}
#@markdown `clip_prompt` is optional, if you leave it blank the same prompt is sent to Stable Diffusion and CLIP
clip_prompt = "" #@param {type: "string"}
num_samples = 1 #@param {type: "number"}
num_inference_steps = 50 #@param {type: "number"}
guidance_scale = 7.5 #@param {type: "number"}
clip_guidance_scale = 100 #@param {type: "number"}
num_cutouts = 4 #@param {type: "number"}
use_cutouts = "False" #@param ["False", "True"]
unfreeze_unet = "True" #@param ["False", "True"]
unfreeze_vae = "True" #@param ["False", "True"]
seed = 3788086447 #@param {type: "number"}

if unfreeze_unet == "True":
  guided_pipeline.unfreeze_unet()
else:
  guided_pipeline.freeze_unet()

if unfreeze_vae == "True":
  guided_pipeline.unfreeze_vae()
else:
  guided_pipeline.freeze_vae()

generator = torch.Generator(device="cuda").manual_seed(seed)

images = []
for i in range(num_samples):
    image = guided_pipeline(
        prompt,
        clip_prompt=clip_prompt if clip_prompt.strip() != "" else None,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale, 
        clip_guidance_scale=clip_guidance_scale,
        num_cutouts=num_cutouts,
        use_cutouts=use_cutouts == "True",
        generator=generator,
    ).images[0]
    images.append(image)

image_grid(images, 1, num_samples)

In [ ]:

CLIP Guided Stable Diffusion using d🧨ffusers

Initial Setup

Authenticate with Hugging Face Hub

CLIP Guided Stable Diffusion

Product

Resources

Company