{"id":1092564,"date":"2024-10-17T05:30:23","date_gmt":"2024-10-17T12:30:23","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=1092564"},"modified":"2024-11-04T21:27:38","modified_gmt":"2024-11-05T05:27:38","slug":"igor-image-goal-representations","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/igor-image-goal-representations\/","title":{"rendered":"IGOR: Image-GOal Representations"},"content":{"rendered":"\n
\n\t\n\t
\n\t\t
\n\t\t\t

IGOR: Image-GOal Representations<\/h1>\n\n\n\n

Atomic Control Units for Foundation Models in Embodied AI<\/h3>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n

\nXiaoyu Chen<\/a>\u2020,\u2662<\/sup>,\u00a0\nJunliang Guo<\/a>\u2020<\/sup>,\u00a0\nTianyu He<\/a>\u2020<\/sup>,\u00a0\nChuheng Zhang<\/a>\u2020<\/sup>,\u00a0\nPushi Zhang<\/a>\u2020<\/sup>,\u00a0\nDerek Cathera Yang<\/a>,\u00a0\nLi Zhao<\/a>\u2020<\/sup>,\u00a0\nJiang Bian<\/a>\u2020<\/sup>\n<\/p>\n

\u2020<\/sup>Microsoft Research, \u2662<\/sup>Tsinghua University <\/p>\n\n\n\n

\n\t\n\t
\n\t\t
\n\t\t\t
<\/div>\n\n\n\n
Read Paper<\/a><\/div> <\/div>\n\n\n\n
<\/div>\n\n\n\n

We introduce IGOR, a framework that learns latent actions<\/strong> from Internet-scale videos that enable cross-embodiment and cross-task generalization.  <\/i><\/p>\n\n\n\n

<\/div>\n\n\n\n

IGOR Framework<\/h2>\n\n\n\n
<\/div>\n\n\n\n

IGOR learns a unified latent action space for humans and robots by compressing visual changes between an image and its goal state on data from both robot and human activities. By labeling latent actions, IGOR facilitates the learning of foundation policy and world models from internet-scale human video data, covering a diverse range of embodied AI tasks. With a semantically consistent latent action space, IGOR enables human-to-robot generalization. The foundation policy model acts as a high-level controller at the latent action level, which is then integrated with a low-level policy to achieve effective robot control. <\/p>\n\n\n\n

Our dataset for pretraining comprises around 2.8M trajectories and video clips, where each trajectory contains a language instruction and a sequence of observations. The datasets are curated from Open-X Embodiment (opens in new tab)<\/span><\/a>, Something-Something-v2 (opens in new tab)<\/span><\/a>, EGTEA (opens in new tab)<\/span><\/a>, Epic Kitchen (opens in new tab)<\/span><\/a>, and Ego4D (opens in new tab)<\/span><\/a>. <\/p>\n\n\n\n

<\/div>\n\n\n\n
\"IGOR<\/figure>\n\n\n\n
<\/div>\n\n\n\n

Extracting Semantically Consistent Latent Actions<\/h3>\n\n\n\n
<\/div>\n\n\n\n

IGOR learns similar latent actions for image pairs with semantically similar visual changes. On out-of-distribution RT-1 dataset, we observe that pairs with similar embeddings have similar visual changes and similar sub-tasks in semantic, for example, \u201copen the gripper\u201d, \u201cmove left\u201d, and \u201cclose the gripper\u201d. Furthermore, we observe that latent actions are shared across different tasks specified by language instructions, thereby facilitating broader generalization. <\/p>\n\n\n\n

<\/div>\n\n\n\n
\"Image-goal<\/figure>\n\n\n\n
<\/div>\n\n\n\n

Migrating Movements Across Different Objects<\/h3>\n\n\n\n
<\/div>\n\n\n\n

IGOR successfully \u201cmigrates\u201d the movements of objects in the one video to other videos. By applying latent actions extracted from one video, we generate new videos with similar movements on different objects with the world model. We observe that latent actions are semantically consistent across tasks with different objects. <\/p>\n\n\n\n

<\/div>\n\n\n\n