{"id":1092564,"date":"2024-10-17T05:30:23","date_gmt":"2024-10-17T12:30:23","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=1092564"},"modified":"2024-11-04T21:27:38","modified_gmt":"2024-11-05T05:27:38","slug":"igor-image-goal-representations","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/igor-image-goal-representations\/","title":{"rendered":"IGOR: Image-GOal Representations"},"content":{"rendered":"\n<div style=\"padding-bottom:64px; padding-top:64px\" class=\"wp-block-msr-immersive-section alignfull row has-background-gradient has-background-gradient-spectrum-3 wp-block-msr-immersive-section\">\n\t\n\t<div class=\"container\">\n\t\t<div class=\"wp-block-msr-immersive-section__wrapper\">\n\t\t\t<h1 class=\"wp-block-heading has-text-align-center\" id=\"igor-image-goal-representations-1\">IGOR: Image-GOal Representations<\/h1>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"are-the-atomic-building-blocks-for-next-level-generation-in-embodied-ai-1\">Atomic Control Units for Foundation Models in Embodied AI<\/h3>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n<p class=\"has-text-align-center\" style=\"font-size:20px\"> \n<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" target=\"_blank\" href=\"https:\/\/cospui.github.io\" rel=\"noopener\">Xiaoyu Chen<\/a><sup>\u2020,\u2662<\/sup>,\u00a0\n<a target=\"_blank\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/junliangguo\/\" rel=\"noopener\">Junliang Guo<\/a><sup>\u2020<\/sup>,\u00a0\n<a target=\"_blank\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/tianyuhe\/\" rel=\"noopener\">Tianyu He<\/a><sup>\u2020<\/sup>,\u00a0\n<a target=\"_blank\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/chuhengzhang\/\" rel=\"noopener\">Chuheng Zhang<\/a><sup>\u2020<\/sup>,\u00a0\n<a target=\"_blank\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/pushizhang\/\" rel=\"noopener\">Pushi Zhang<\/a><sup>\u2020<\/sup>,\u00a0\n<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" target=\"_blank\" href=\"https:\/\/scholar.google.com\/citations?user=QVrixoEAAAAJ&hl=en\" rel=\"noopener\">Derek Cathera Yang<\/a>,\u00a0\n<a target=\"_blank\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lizo\/\" rel=\"noopener\">Li Zhao<\/a><sup>\u2020<\/sup>,\u00a0\n<a target=\"_blank\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jiabia\/\" rel=\"noopener\">Jiang Bian<\/a><sup>\u2020<\/sup>\n<\/p>\n<p class=\"has-text-align-center\" style=\"font-size:18px\">  <sup>\u2020<\/sup>Microsoft Research, <sup>\u2662<\/sup>Tsinghua University <\/p>\n\n\n\n<div style=\"padding-bottom:0; padding-top:0\" class=\"wp-block-msr-immersive-section alignfull row wp-block-msr-immersive-section\">\n\t\n\t<div class=\"container\">\n\t\t<div class=\"wp-block-msr-immersive-section__wrapper col-lg-11 col-xl-9 px-0 m-auto\">\n\t\t\t<div style=\"height:15px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-1 wp-block-buttons-is-layout-flex\" style=\"margin: 20px\"> <div class=\"wp-block-button is-style-pill\"><a data-bi-type=\"button\" class=\"wp-block-button__link has-text-align-center wp-element-button\" href=\"https:\/\/arxiv.org\/pdf\/2411.00785\">Read Paper<\/a><\/div> <\/div>\n\n\n\n<div style=\"height:19px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"has-text-align-center\" style=\"font-size:30px\"> <i> We introduce IGOR, a framework that learns <strong>latent actions<\/strong> from Internet-scale videos that enable cross-embodiment and cross-task generalization.&nbsp; <\/i><\/p>\n\n\n\n<figure class=\"wp-block-video\"><video autoplay controls loop muted src=\"https:\/\/github.com\/iclr-2025-4517\/index\/raw\/refs\/heads\/main\/videos\/website_videos_v4\/video_to_video_v2.mp4\"><\/video><\/figure>\n\n\n\n<div style=\"height:80px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"igor-framework\">IGOR Framework<\/h2>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p style=\"font-size:20px\"> IGOR learns a unified latent action space for humans and robots by compressing visual changes between an image and its goal state on data from both robot and human activities. By labeling latent actions, IGOR facilitates the learning of foundation policy and world models from internet-scale human video data, covering a diverse range of embodied AI tasks. With a semantically consistent latent action space, IGOR enables human-to-robot generalization. The foundation policy model acts as a high-level controller at the latent action level, which is then integrated with a low-level policy to achieve effective robot control. <\/p>\n\n\n\n<p style=\"font-size:20px\"> Our dataset for pretraining comprises around 2.8M trajectories and video clips, where each trajectory contains a language instruction and a sequence of observations. The datasets are curated from <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/robotics-transformer-x.github.io\">Open-X Embodiment<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.qualcomm.com\/developer\/software\/something-something-v-2-dataset\">Something-Something-v2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/cbs.ic.gatech.edu\/fpv\/\">EGTEA<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/epic-kitchens\">Epic Kitchen<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ego4d-data.org\">Ego4D<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. <\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"509\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/igor_fig1_v4_forwebpng-1024x509.png\" alt=\"IGOR Framework\" class=\"wp-image-1094577\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/igor_fig1_v4_forwebpng-1024x509.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/igor_fig1_v4_forwebpng-300x149.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/igor_fig1_v4_forwebpng-768x382.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/igor_fig1_v4_forwebpng-1536x764.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/igor_fig1_v4_forwebpng-2048x1019.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/igor_fig1_v4_forwebpng-240x119.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:60px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"extracting-semantically-consistent-latent-actions\">Extracting Semantically Consistent Latent Actions<\/h3>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p style=\"font-size:20px\"> IGOR learns similar latent actions for image pairs with semantically similar visual changes. On out-of-distribution RT-1 dataset, we observe that pairs with similar embeddings have similar visual changes and similar sub-tasks in semantic, for example, \u201copen the gripper\u201d, \u201cmove left\u201d, and \u201cclose the gripper\u201d. Furthermore, we observe that latent actions are shared across different tasks specified by language instructions, thereby facilitating broader generalization. <\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"526\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/fig2_vis_3-5-1_v3-3-6709c81a7de97-1024x526.png\" alt=\"Image-goal pairs with similar latent actions in OOD RT-1 dataset.\" class=\"wp-image-1093188\" style=\"width:1024px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/fig2_vis_3-5-1_v3-3-6709c81a7de97-1024x526.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/fig2_vis_3-5-1_v3-3-6709c81a7de97-300x154.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/fig2_vis_3-5-1_v3-3-6709c81a7de97-768x394.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/fig2_vis_3-5-1_v3-3-6709c81a7de97-1536x788.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/fig2_vis_3-5-1_v3-3-6709c81a7de97-2048x1051.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/fig2_vis_3-5-1_v3-3-6709c81a7de97-240x123.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:60px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"migrating-movements-across-different-objects\">Migrating Movements Across Different Objects<\/h3>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p style=\"font-size:20px\"> IGOR successfully \u201cmigrates\u201d the movements of objects in the one video to other videos. By applying latent actions extracted from one video, we generate new videos with similar movements on different objects with the world model. We observe that latent actions are semantically consistent across tasks with different objects. <\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video autoplay loop muted src=\"https:\/\/raw.githubusercontent.com\/iclr-2025-4517\/index\/refs\/heads\/main\/videos\/website_videos_v4\/fig2_hand_merge_v2.mp4\" playsinline><\/video><\/figure>\n\n\n\n<div style=\"height:60px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"migrating-movements-from-human-to-robots\">Migrating Movements from Human to Robots<\/h3>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p style=\"font-size:20px\"> Impressively, IGOR learns latent actions that are semantically consistent across human and robots. With only one demonstration, IGOR can successfully migrate human behaviors to robot arms through only latent actions, which opens up new possibilities for few-shot human-to-robot transfer and control. <\/p>\n\n\n\n<div style=\"height:10px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video autoplay loop muted src=\"https:\/\/raw.githubusercontent.com\/iclr-2025-4517\/index\/refs\/heads\/main\/videos\/website_videos_v4\/vid_merge_v2.mp4\" playsinline><\/video><\/figure>\n\n\n\n<div style=\"height:60px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"controlling-different-objects-separately\">Controlling Different Objects Separately<\/h3>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p style=\"font-size:20px\"> IGOR learns to control different object&#8217;s movements separately among multiple objects. Effects of applying different latent actions are presented in the video: (a,b) move the apple, (c, d) move the tennis, and (e,f) move the orange.  <\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-video\"><video autoplay loop muted src=\"https:\/\/github.com\/iclr-2025-4517\/index\/raw\/refs\/heads\/main\/videos\/website_videos_v4\/Presentation8.mp4\" playsinline><\/video><\/figure>\n\n\n\n<div style=\"height:60px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"instruction-following-with-policy-and-world-model\">Instruction Following with Policy and World Model<\/h3>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p style=\"font-size:20px\"> IGOR can follow language instructions via iteratively rolling out the foundation policy and world model. Starting from the same initial image, IGOR can generate diverse behaviors in videos that follow different instructions through latent actions. <\/p>\n\n\n\n<figure class=\"wp-block-video aligncenter\"><video autoplay loop muted src=\"https:\/\/raw.githubusercontent.com\/iclr-2025-4517\/index\/refs\/heads\/main\/videos\/website_videos_v4\/task_merge_v2.mp4\" playsinline><\/video><\/figure>\n\n\n\n<div style=\"height:60px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\" id=\"igor-facilitates-learning-on-low-level-policy-model\">IGOR Facilitates Learning on Low-Level Policy Model<\/h3>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p style=\"font-size:20px\"> We find that IGOR framework improves policy learning under a low-data regime on the Google robot tasks in the SIMPLER simulator shown in (a), potentially due to its capability to predict the next sub-task by leveraging internet-scale data, thereby enabling sub-task level generalization.  <\/p>\n<p style=\"font-size:20px\"> We also observe that image-goal pairs with similar latent actions are associated with similar low-level robot actions shown in (b). Our experiments indicate that IGOR&#8217;s learned action space reflects more information in robot movements than robot arm rotations and gripping.  <\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"314\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/exp4-simpler-std-1024x314.png\" alt=\"wrfwfwfwwdwd\n\" class=\"wp-image-1093461\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/exp4-simpler-std-1024x314.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/exp4-simpler-std-300x92.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/exp4-simpler-std-768x236.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/exp4-simpler-std-1536x472.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/exp4-simpler-std-2048x629.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/10\/exp4-simpler-std-240x74.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:44px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"citation-1\">Citation<\/h2>\n\n\n\n<div style=\"height:8px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<pre class=\"wp-block-code\"><code>@inproceedings{chen2024igor,\n    title={IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Model in Embodied AI},\n    author={Xiaoyu Chen and Junliang Guo and Tianyu He and Chuheng Zhang and Pushi Zhang and Derek Cathera Yang and Li Zhao and Jiang Bian},\n    year={2024},\n    url={https:\/\/arxiv.org\/abs\/2411.00785}\n}<\/code><\/pre>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Xiaoyu Chen\u2020,\u2662,\u00a0 Junliang Guo\u2020,\u00a0 Tianyu He\u2020,\u00a0 Chuheng Zhang\u2020,\u00a0 Pushi Zhang\u2020,\u00a0 Derek Cathera Yang,\u00a0 Li Zhao\u2020,\u00a0 Jiang Bian\u2020 \u2020Microsoft Research, \u2662Tsinghua University<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1092564","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[1095039],"related-downloads":[],"related-videos":[],"related-groups":[269241],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"guest","display_name":"Xiaoyu Chen","user_id":1093218,"people_section":"Related people","alias":""},{"type":"user_nicename","display_name":"Junliang Guo","user_id":41188,"people_section":"Related people","alias":"junliangguo"},{"type":"user_nicename","display_name":"Tianyu He","user_id":42141,"people_section":"Related people","alias":"tianyuhe"},{"type":"user_nicename","display_name":"Chuheng Zhang","user_id":43629,"people_section":"Related people","alias":"chuhengzhang"},{"type":"user_nicename","display_name":"Li Zhao","user_id":36152,"people_section":"Related people","alias":"lizo"},{"type":"user_nicename","display_name":"Pushi Zhang","user_id":43626,"people_section":"Related people","alias":"pushizhang"}],"msr_research_lab":[199560],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1092564","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":136,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1092564\/revisions"}],"predecessor-version":[{"id":1100871,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1092564\/revisions\/1100871"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1092564"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1092564"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1092564"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1092564"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1092564"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}