VideoTetris: Towards Compositional
Text-To-Video Generation

1Peking University, 2Kuaishou Technology
*Equal Contribution, Corresponding Authors

VideoTetris is a novel framework that enables compositional text-to-video generation. We identify compositional video generation in two scenarios: Video Generation with Compositional Prompts and Long Video Generation with Progressive Compositional Prompts. We compare VideoTetris with both short and long video generation models, and it showcases superior performance in compositional generation with the capability of precisely following position information, consistent scene transitions, and various features of different sub-objects.

Abstract

Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new \textit{reference frame attention} mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation.

Method

The overall pipeline of VideoTetris. We introduce Spatio-Temporal Compositional module for compositional video generation and Reference Frame Attention. For longer video generation, a ControlNet-like branch can be adopted for auto-regressive generation.

Spatio-Temporal Compositional Diffusion

Illustration of Spatio-Temporal Compositional Diffusion. For a given story "A little dolphin starts exploring an old city under the sea, she first found a green turtle at the bottom, then her huge father comes along to accompany her at the right side.", we first decompose it temporally to Text Prompt #1, #2 and #3, then we decompose each of them spatially to compute each sub-region's cross attention maps. Finally, we compose them spatio-temporally to form a natural story.

Comparison of Video Generation with Compositional Prompts

Gen-2
(Commercial)
Pika.art
(Commercial)
Open Sora Plan V1.1
AnimateDiff
VideoCrafter2
VideoTetris
A heroic robot on the left and a magical girl on the right are saving the day. A brave knight and a wise wizard are journeying through a forest. A cute brown dog on the left and a sleepy cat on the right are napping in the sun. A talking sponge on the left and a superhero baby on the right are having an adventure.

Comparison of Long Video Generation with Progressive Compositional Prompts

FreeNoise StreamingT2V VideoTetris
A brave knight is journeying through a forest.
------> (transitions to)
A brave knight and a wise wizard are journeying through a forest.
@80 Frames
A cute brown squirrel in Antarctica, on a pile of hazelnuts cinematic.
------> (transitions to)
A cute brown squirrel and a cute white squirrel in Antarctica, on a pile of hazelnuts cinematic
@240 Frames

More examples on the way!

BibTeX

@article{tian2024videotetris,
      title={VideoTetris: Towards Compositional Text-to-Video Generation},
      author={Tian, Ye and Yang, Ling and Yang, Haotian and Gao, Yuan and Deng, Yufan and Chen, Jingmin and Wang, Xintao and Yu, Zhaochen and Tao, Xin and Wan, Pengfei and Zhang, Di and Cui, Bin},
      journal={arXiv preprint arXiv:2406.04277},
      year={2024}
    }