{"id":1052988,"date":"2024-07-10T18:31:31","date_gmt":"2024-07-11T01:31:31","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=1052988"},"modified":"2024-07-10T18:31:35","modified_gmt":"2024-07-11T01:31:35","slug":"video-in-context-learning","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/video-in-context-learning\/","title":{"rendered":"Video In-Context Learning"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

Video In-Context Learning<\/h1>\n\n\n\n

Driving large vision models with video demonstrations<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

\n
Paper<\/a><\/div>\n<\/div>\n\n\n\n

In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image guided by demonstrations. In this project, we propose and study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences, each semantically guided by the prompted video demonstrations. To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets. We thoroughly analyze the effect of different datasets and represent frames as discrete tokens, and then model them by next token predictions.<\/p>\n\n\n\n

As a result, the obtained vision Transformer model is able to generate a subsequent video sequence of a query video clip, which is semantically aligned with the demonstration video. Demonstration videos are highly versatile and capable of conveying a wide range of information, such as examples for various tasks including moving or grabbing objects, or movements of the camera in an ego-centric video. This allows video in-context learning to address multiple downstream tasks, such as embodied planning and simulating, by letting a query robot imitate the actions demonstrated by other robots, as shown below. Video in-context learning serves as a new and crucial interface for models to interact with the real world<\/em><\/strong>, as videos are good at describing low-level details (where language may fall short) and temporal dynamics (where images are insufficient).<\/p>\n\n\n\n

\"Model
The pipeline of Vid-ICL. Left: Training of Vid-ICL. The data used for training are continuous video clips and the Transformer is trained by next token prediction. Right: In-context Inference of Vid-ICL. The model is conditioned on demonstration videos and generates future frames.<\/figcaption><\/figure>\n\n\n\n

<\/p>\n\n\n\n

Generated Samples on Something-Something v2<\/h4>\n\n\n\n

<\/p>\n\n\n\n

\n
\n

Demonstration<\/p>\n<\/div>\n\n\n\n

\n

DeLVM<\/p>\n<\/div>\n\n\n\n

\n

Vid-ICL (700M, pt)<\/p>\n<\/div>\n\n\n\n

\n

Vid-ICL (1.1B, pt)<\/p>\n<\/div>\n\n\n\n

\n

Vid-ICL (700M, ft)<\/p>\n<\/div>\n\n\n\n

\n

Vid-ICL (1.1B, ft)<\/p>\n<\/div>\n<\/div>\n\n\n\n

\n
\n
\n
\n
\n
\n
\n