{"id":951717,"date":"2023-06-29T09:00:00","date_gmt":"2023-06-29T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=951717"},"modified":"2023-06-29T09:06:24","modified_gmt":"2023-06-29T16:06:24","slug":"breaking-cross-modal-boundaries-in-multimodal-ai-introducing-codi-composable-diffusion-for-any-to-any-generation","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/breaking-cross-modal-boundaries-in-multimodal-ai-introducing-codi-composable-diffusion-for-any-to-any-generation\/","title":{"rendered":"Breaking cross-modal boundaries in multimodal AI: Introducing CoDi, composable diffusion for any-to-any generation"},"content":{"rendered":"\n<p>Imagine an AI model that can seamlessly generate high-quality content across text, images, video, and audio, all at once. Such a model would more accurately capture the multimodal nature of the world and human comprehension, seamlessly consolidate information from a wide range of sources, and enable strong immersion in human-AI interactions. This could transform the way humans interact with computers on various tasks, including assistive technology, custom learning tools, ambient computing, and content generation.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<ul class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<li class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/any-to-any-generation-via-composable-diffusion\/\" target=\"_self\" class=\"annotations__link font-weight-semibold text-decoration-none\" data-bi-type=\"annotated-link\" aria-label=\"Any-to-Any Generation via Composable Diffusion\" data-bi-aN=\"margin-callout\" data-bi-cN=\"Any-to-Any Generation via Composable Diffusion\">\n\t\t\t\tAny-to-Any Generation via Composable Diffusion&nbsp;<span class=\"glyph-append glyph-append-chevron-right glyph-append-xsmall\"><\/span>\n\t\t\t<\/a>\n\t\t\t\t\t<\/li>\n\t<\/ul>\n<\/div>\n\n\n\n<p>In a recent paper: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/any-to-any-generation-via-composable-diffusion\/\">Any-to-Any Generation via Composable Diffusion<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/knowledge-and-language\/\" target=\"_blank\" rel=\"noreferrer noopener\">Microsoft Azure Cognitive Service Research<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/nlp.cs.unc.edu\/\" target=\"_blank\" rel=\"noreferrer noopener\">UNC NLP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> present CoDi, a novel generative model capable of processing and simultaneously generating content across multiple modalities. CoDi allows for the synergistic generation of high-quality and coherent outputs spanning various modalities, from assorted combinations of input modalities. CoDi is the latest work of Microsoft\u2019s Project <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/i-Code\" target=\"_blank\" rel=\"noreferrer noopener\">i-Code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which aims to develop integrative and composable multimodal AI. Through extensive experiments, the researchers demonstrate CoDi&#8217;s remarkable capabilities.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-challenge-of-multimodal-generative-ai\">The challenge of multimodal generative AI<\/h2>\n\n\n\n<p>The powerful cross-modal models that have emerged in recent years are mostly capable of generating or processing just one single modality. These models often face limitations in real-world applications where multiple modalities coexist and interact. Chaining modality-specific generative models together in a multi-step generation setting can be cumbersome and slow.<\/p>\n\n\n\n<p>Moreover, independently generated unimodal streams may not be consistent and aligned when stitched together in a post-processing way, such as synchronized video and audio.<\/p>\n\n\n\n<p>To address these challenges, the researchers propose Composable Diffusion (CoDi), the first model capable of simultaneously processing and generating arbitrary combinations of modalities. CoDi employs a novel composable generation strategy that involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio.<\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1085526\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Microsoft research podcast<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/whats-your-story-lex-story\/\" aria-label=\"What\u2019s Your Story: Lex Story\" data-bi-cN=\"What\u2019s Your Story: Lex Story\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2024\/08\/Lex-Story_WYS_Hero_Feature_1400x788.jpg\" alt=\"photo of Lex Story for the What's Your Story episode of the Microsoft Research podcast\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">What\u2019s Your Story: Lex Story<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p class=\"large\">Model maker and fabricator Lex Story helps bring research to life through prototyping. He discusses his take on failure; the encouragement and advice that has supported his pursuit of art and science; and the sabbatical that might inspire his next career move.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/podcast\/whats-your-story-lex-story\/\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" aria-label=\"Listen now\" data-bi-cN=\"What\u2019s Your Story: Lex Story\" target=\"_blank\">\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h3 class=\"wp-block-heading\" id=\"the-power-of-composable-diffusion\">The power of composable diffusion<\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-style-spectrum\"><img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"1080\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi_Fig1-min_animation.gif\" alt=\"A GIF of CoDi generation pipelines. The input modalities are listed vertically on the left side, including the text \u201cTeddy bear on a skateboard, 4k\u201d, a picture of Times Square, and the waveform of raining ambience. Input modalities are input to the CoDi model, depicted by a rectangular block, and output modalities are listed on the right side. Input modalities, CoDi, and output modalities connected by lines of different colors to represent different generation pipelines. The yellow line depicts that given an input of rain audio, CoDi can generate the text description \u201cRaining, rain, moderate\u201d. Depicted by red lines, CoDi can take in the image of Times Square together with the rain audio, to generate the audio of raining street. Finally, depicted by purple lines, the input modalities are the text \u201cTeddy bear on a skateboard, 4k\u201d, the picture of Times Square, and the raining audio; the output is a video with sound. In the video, a Teddy bear is skateboarding in the rain on the street of Times Square, and one can hear synchronized sound of skateboarding and rain.\" class=\"wp-image-952026\"\/><figcaption class=\"wp-element-caption\">Figure 1: CoDi can generate any combination of modalities from any mixture of input modalities.<\/figcaption><\/figure>\n\n\n\n<p>Training a model to take any mixture of input modalities and flexibly generate any mixture of outputs presents significant computational and data requirements, as the number of combinations for the input and output modalities scales exponentially. And the scarcity of aligned training data for many groups of modalities makes it infeasible to train with all possible input-output combinations. To address these challenges, the researchers propose to build CoDi in a composable and integrative manner.<\/p>\n\n\n\n<p>They start by training each individual modality-specific latent diffusion model (LDM) independently (these LDMs will be smoothly integrated later for joint generation). This approach ensures exceptional single-modality generation quality using widely available modality-specific training data. To allow CoDi to handle any mixture of inputs, input modalities like images, video, audio, and language are projected into the same semantic space. Consequently, the LDM of each modality can flexibly process any mixture of multimodal inputs. The multi-conditioning generation process is done by letting diffusers be conditioned on these inputs via a weighted sum of each input modality\u2019s representation.<\/p>\n\n\n\n<p>One of CoDi\u2019s most significant innovations is its ability to handle many-to-many generation strategies, simultaneously generating any mixture of output modalities. To achieve this, CoDi adds a cross-attention module to each diffuser, and an environment encoder to project the latent variable of different LDMs into a shared latent space.<\/p>\n\n\n\n<p>By freezing the parameters of the LDM and training only the cross-attention parameters and the environment encoder, CoDi can seamlessly generate any group of modalities without training on all possible generation modality combinations, reducing the training objectives to a more manageable number.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"showcasing-codi-s-capabilities\">Showcasing CoDi&#8217;s capabilities<\/h3>\n\n\n\n<p>The research demonstrates the novel capacity of joint generation of multiple modalities, such as synchronized video and audio, given separate text, audio, and image prompts. Specifically, in the example shown below, the input text prompt is \u201cteddy bear on a skateboard, 4k, high resolution\u201d, the input image prompt is a picture of Times Square, and the input audio prompt is rain. The generated video, shown in Figure 2, is a teddy bear skateboarding in the rain at Times Square. The generated audio contains the sounds of rain, skateboarding, and street noise, which are synchronized with the video.&nbsp;This shows that CoDi can consolidate information from multiple input modalities and generate coherent and aligned outputs.<\/p>\n\n\n\n<figure class=\"wp-block-embed aligncenter is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"CoDi demonstration video\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/gJ5L254gpMs?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><figcaption class=\"wp-element-caption\">Figure 2: The video shows an example of CoDi generating video + audio from text, image and audio input. The input modalities are listed vertically on the left side, including the text \u201cTeddy bear on a skateboard, 4k\u201d, a picture of Times Square, and the waveform of raining ambience. The output is a video with sound. In the video, a Teddy bear is skateboarding in the rain on the street of Times Square. One can also hear synchronized sound of skateboarding and rain.<\/figcaption><\/figure>\n\n\n\n<p>In addition to its strong joint-modality generation quality, CoDi is also capable of single-to-single modality generation and multi-conditioning generation. It outperforms or matches the unimodal state of the art for single-modality synthesis.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/codi-gen.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">CoDi generation examples<\/a><\/div>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"potential-real-world-applications-and-looking-forward\">Potential real-world applications and looking forward<\/h3>\n\n\n\n<p>CoDi&#8217;s development unlocks numerous possibilities for real-world applications requiring multimodal integration. For example, in education, CoDi can generate dynamic, engaging materials catering to diverse learning styles, allowing learners to access information tailored to their preferences, while enhancing understanding and knowledge retention. CoDi can support some accessible experiences for people with disabilities, such as providing audio descriptions and visual cues for deaf or low-hearing individuals.<\/p>\n\n\n\n<p>Composable Diffusion marks a significant step towards more engaging and holistic human-computer interactions, establishing a solid foundation for future investigations in generative artificial intelligence.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/codi-gen.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Visit the CoDi project<\/a><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Imagine an AI model that can seamlessly generate high-quality content across text, images, video, and audio, all at once. Such a model would more accurately capture the multimodal nature of the world and human comprehension, seamlessly consolidate information from a wide range of sources, and enable strong immersion in human-AI interactions. This could transform the [&hellip;]<\/p>\n","protected":false},"author":42735,"featured_media":951741,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,243062,13562,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[264846],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-951717","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-audio-acoustics","msr-research-area-computer-vision","msr-research-area-human-language-technologies","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":["Computing foundations"],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[741481],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"zineng-tang","user_id":"951723","display_name":"Zineng Tang","author_link":"Zineng Tang","is_active":true,"last_first":"Tang, Zineng","people_section":0,"alias":"zineng-tang"},{"type":"user_nicename","value":"Ziyi Yang","user_id":40561,"display_name":"Ziyi Yang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/ziyiyang\/\" aria-label=\"Visit the profile page for Ziyi Yang\">Ziyi Yang<\/a>","is_active":false,"last_first":"Yang, Ziyi","people_section":0,"alias":"ziyiyang"},{"type":"user_nicename","value":"Michael Zeng","user_id":33141,"display_name":"Michael Zeng","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/nzeng\/\" aria-label=\"Visit the profile page for Michael Zeng\">Michael Zeng<\/a>","is_active":false,"last_first":"Zeng, Michael","people_section":0,"alias":"nzeng"},{"type":"guest","value":"mohit-bansal","user_id":"951726","display_name":"Mohit Bansal","author_link":"Mohit Bansal","is_active":true,"last_first":"Bansal, Mohit","people_section":0,"alias":"mohit-bansal"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"CoDi can generate any combination of modalities from any, all at once.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/CoDi-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"June 29, 2023","formattedExcerpt":"Imagine an AI model that can seamlessly generate high-quality content across text, images, video, and audio, all at once. Such a model would more accurately capture the multimodal nature of the world and human comprehension, seamlessly consolidate information from a wide range of sources, and&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/951717"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=951717"}],"version-history":[{"count":18,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/951717\/revisions"}],"predecessor-version":[{"id":952869,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/951717\/revisions\/952869"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/951741"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=951717"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=951717"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=951717"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=951717"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=951717"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=951717"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=951717"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=951717"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=951717"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=951717"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=951717"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}