In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image guided by demonstrations. In this project, we propose and study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences, each semantically guided by the prompted video demonstrations. To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets. We thoroughly analyze the effect of different datasets and represent frames as discrete tokens, and then model them by next token predictions.
As a result, the obtained vision Transformer model is able to generate a subsequent video sequence of a query video clip, which is semantically aligned with the demonstration video. Demonstration videos are highly versatile and capable of conveying a wide range of information, such as examples for various tasks including moving or grabbing objects, or movements of the camera in an ego-centric video. This allows video in-context learning to address multiple downstream tasks, such as embodied planning and simulating, by letting a query robot imitate the actions demonstrated by other robots, as shown below. Video in-context learning serves as a new and crucial interface for models to interact with the real world, as videos are good at describing low-level details (where language may fall short) and temporal dynamics (where images are insufficient).
Generated Samples on Something-Something v2
Demonstration
DeLVM
Vid-ICL (700M, pt)
Vid-ICL (1.1B, pt)
Vid-ICL (700M, ft)
Vid-ICL (1.1B, ft)
Qualitative results on the generated samples by Vid-ICL on Something-Something v2 dataset, a dataset of real world human activities. . The first column shows the demonstration video. The second column shows the generated samples by DeLVM. The third to sixth columns show the generated samples by Vid-ICL with different pretraining and finetuning strategies. It shows that Vid-ICL generates more diverse and plausible samples than DeLVM, and the quality of the generated samples is improved by scaling model sizes and in-domain fine-tuning.
Generated Samples on Robotics Transformer Dataset
Demonstration
Vid-ICL (700M, pt)
Vid-ICL (1.1B, pt)
Vid-ICL (300M, ft)
Vid-ICL (700M, ft)
Vid-ICL (1.1B, ft)
Qualitative results on the generated samples by Vid-ICL on Robotics Transformer(RT-1) dataset, a dataset of robotic manipulation tasks. Vid-ICL offers more concise control over various robotic tasks with the demonstration video.
Contrastive Demonstration
Demonstration
Generation
Contrastive Demonstration
Contrastive Generation
In this section, we give shared prompts to Vid-ICL in the second and fourth columns of each row, and a contrastive prompt in the first and third columns. Results shows an important feature of Vid-ICL that when conditioned on demonstration with contrastive semantics, Vid-ICL generates contrary samples given the same demonstration, which demonstrates Vid-ICL’s ability to precisely understand the demonstration dynamics and generate diverse future sequences.
Text Conditioned Generation
Text
Demonstration
Generation w/o Text
Generation w/ Text
Push [something] from left to right
Turning the camera left while filming [something]
Moving [something] closer to [something]
We show that text can be used as an additional in-context condition signals to augment the generation of Vid-ICL. After aligning Vid-ICL to the text, the generated samples are more consistent with the text instructions and demonstration videos.
On Reinforcement Learning Tasks
We demonstrate that Vid-ICL can also function as a simulator in reinforcement learning tasks by evaluating it on the RoboDesk, a test benchmark containing several reinforcement learning tasks. Vid-ICL generates future frames with videos that accomplishing the same task as the demonstration, where the generated frames inversely reflect the corresponding actions that correctly interact with the environment. Evaluating on the Push_red task, we find that Vid-ICL provides more precise control over the environment interaction.
Ethics Statement
Vid-ICL is exclusively a research initiative with no current plans for product integration or public access. We are committed to adhering to Microsoft AI principles during the ongoing development of our models. The datasets utilized in this study are publicly available and have been thoroughly reviewed to ensure they do not include personally identifiable information or offensive content. Nonetheless, as these datasets are sourced from the Internet, there may still be inherent biases. To address this, we have implemented a rigorous filtering process on the training data to minimize the potential for the model to generate inappropriate content.
Citation
@article{zhang2024video,
title={Video In-Context Learning},
author={Zhang, Wentao and Guo, Junliang and He, Tianyu and Zhao, Li and Xu, Linli and Bian, Jiang},
journal={arXiv preprint arXiv:2407.07356},
year={2024}
}