background pattern

Compositional 3D-aware Video Generation

In a Magician’s magical cabin alone in a serene forest, an alien walking on the floor, starting from the cabin’s door to the mow near the bottom right corner of this image.
Four characters stood on the stage. In front of the stage, a man and a woman are performing China Kung Fu and dancing respectively. On the right side of the stage, a skeleton man is dancing, and behind them, a clown is performing.
In a long anime-style road with anime-blocks and little anime-grass, anime-houses and anime-tree on the side of the anime-style road, an alchemist is walking while swinging his arms.
In an idyllic park with vibrant flowers, lush greenery, and a serene lake, a person with a curly beard and a black hoodie is walking with touching on tiptoe.

Abstract

Significant progress has been made in text-to-video generation through the use of powerful generative models and large-scale internet data. However, substantial challenges remain in precisely controlling individual concepts within the generated video, such as the motion and appearance of specific characters and the movement of viewpoints. In this work, we propose a novel paradigm that generates each concept in 3D representation separately and then composes them with priors from Large Language Models (LLM) and 2D diffusion models. Specifically, given an input textual prompt, our scheme consists of three stages: 1) We leverage LLM as the director to first decompose the complex query into several sub-prompts that indicate individual concepts within the video (e.g., scene, objects, motions), then we let LLM to invoke pre-trained expert models to obtain corresponding 3D representations of concepts. 2) To compose these representations, we prompt multi-modal LLM to produce coarse guidance on the scales and coordinates of trajectories for the objects. 3) To make the generated frames adhere to natural image distribution, we further leverage 2D diffusion priors and use Score Distillation Sampling to refine the composition. Extensive experiments demonstrate that our method can generate high-fidelity videos from text with diverse motion and flexible control over each concept.


Method

graphical user interface, website

Our method consists of three stages: 1) The input textual prompt is decomposed into individual concepts by the LLM. Then we generate each concept in the form of 3D with the corresponding pre-trained expert model. 2) We leverage knowledge in multi-modal LLM to estimate the 2D trajectory of objects step-by-step. 3) After lifting the estimated 2D trajectory into 3D as initialization, we refine the scales, locations, and rotations of objects within the 3D scene using 2D diffusion priors.


Extension: Object Editing


Extension: Motion Editing


Extension: Scene Editing


Comparisons

graphical user interface, website
In a Magician’s magical cabin alone in a serene forest, an alien is walking on the floor, starting from the cabin’s door to the mow near the bottom right corner of this image.
graphical user interface, website
Four characters stood on the stage. In front of the stage, a man and a woman are performing Kung Fu and dancing respectively. On the right side of the stage, a skeleton man is dancing, and behind them, a clown is performing.

Ethics Statement

C3V is exclusively a research initiative with no current plans for product integration or public access. We are committed to adhering to Microsoft AI principles during the ongoing development of our models. The model is trained on AI-generated content, which has been thoroughly reviewed to ensure that they do not include personally identifiable information or offensive content. Nonetheless, as these generated data are sourced from the Internet, there may still be inherent biases. To address this, we have implemented a rigorous filtering process on the data to minimize the potential for the model to generate inappropriate content.


Acknowledgments

Our work is based on the following excellent works: LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes (opens in new tab), HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting (opens in new tab), Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset (opens in new tab).


BibTeX

@article{zhu2024compositional,
  title={Compositional 3D-aware Video Generation with LLM Director},
  author={Zhu, Hanxin and He, Tianyu and Tang, Anni and Guo, Junliang and Chen, Zhibo and Bian, Jiang},
  journal={Advances in Neural Information Processing Systems},
  year={2024}
}

This work is accomplished in Microsoft, April 2024.