Path: blob/main/transformers_doc/en/video_text_to_text.ipynb
8467 views
Video-text-to-text
Video-text-to-text, also known as video language models are models that can process video and output text. These models can tackle various tasks, from video question answering to video captioning.
These models have nearly the same architecture as image-text-to-text models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos.
Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? <video>".
Note that these models process videos with no audio. Any-to-any models on the other hand can process videos with audio in them.
In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.
To begin with, there are multiple types of video LMs:
base models used for fine-tuning
chat fine-tuned models for conversation
instruction fine-tuned models
This guide focuses on inference with an instruction-tuned model, llava-hf/llava-onevision-qwen2-0.5b-ov-hf which can take in interleaved data. Alternatively, you can try llava-interleave-qwen-0.5b-hf if your hardware doesn't allow running a 7B model.
Let's begin installing the dependencies.
Let's initialize the model and the processor.
We will infer with two videos, both have cats.
Videos are series of image frames. Depending on the hardware limitations, downsampling is required. If the number of downsampled frames are too little, predictions will be low quality.
Video-text-to-text models have processors with video processor abstracted in them. You can pass video inference related arguments to apply_chat_template() function.
[!WARNING] You can learn more about video processors here.
We can define our chat history, passing in video with a URL like below.
You can preprocess the videos by passing in messages, setting do_sample_frames to True and passing in num_frames. Here we sample 10 frames.
The inputs contain input_ids for tokenized text, pixel_values_videos for 10 frames and attention_mask for which tokens .
We can now infer with our preprocessed inputs and decode them.
You can also interleave multiple videos with text directly in chat template like below.
The inference remains the same as the previous example.