{"id":951717,"date":"2023-06-29T09:00:00","date_gmt":"2023-06-29T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=951717"},"modified":"2023-06-29T09:06:24","modified_gmt":"2023-06-29T16:06:24","slug":"breaking-cross-modal-boundaries-in-multimodal-ai-introducing-codi-composable-diffusion-for-any-to-any-generation","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/breaking-cross-modal-boundaries-in-multimodal-ai-introducing-codi-composable-diffusion-for-any-to-any-generation\/","title":{"rendered":"Breaking cross-modal boundaries in multimodal AI: Introducing CoDi, composable diffusion for any-to-any generation"},"content":{"rendered":"\n
Imagine an AI model that can seamlessly generate high-quality content across text, images, video, and audio, all at once. Such a model would more accurately capture the multimodal nature of the world and human comprehension, seamlessly consolidate information from a wide range of sources, and enable strong immersion in human-AI interactions. This could transform the way humans interact with computers on various tasks, including assistive technology, custom learning tools, ambient computing, and content generation.<\/p>\n\n\n\n
In a recent paper: Any-to-Any Generation via Composable Diffusion (opens in new tab)<\/span><\/a>, Microsoft Azure Cognitive Service Research (opens in new tab)<\/span><\/a> and UNC NLP (opens in new tab)<\/span><\/a> present CoDi, a novel generative model capable of processing and simultaneously generating content across multiple modalities. CoDi allows for the synergistic generation of high-quality and coherent outputs spanning various modalities, from assorted combinations of input modalities. CoDi is the latest work of Microsoft\u2019s Project i-Code (opens in new tab)<\/span><\/a>, which aims to develop integrative and composable multimodal AI. Through extensive experiments, the researchers demonstrate CoDi’s remarkable capabilities.<\/p>\n\n\n\n The powerful cross-modal models that have emerged in recent years are mostly capable of generating or processing just one single modality. These models often face limitations in real-world applications where multiple modalities coexist and interact. Chaining modality-specific generative models together in a multi-step generation setting can be cumbersome and slow.<\/p>\n\n\n\n Moreover, independently generated unimodal streams may not be consistent and aligned when stitched together in a post-processing way, such as synchronized video and audio.<\/p>\n\n\n\n To address these challenges, the researchers propose Composable Diffusion (CoDi), the first model capable of simultaneously processing and generating arbitrary combinations of modalities. CoDi employs a novel composable generation strategy that involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio.<\/p>\n\n\n\n \n\t\tMicrosoft research podcast<\/span>\n\t<\/p>\n\t\n\tThe challenge of multimodal generative AI<\/h2>\n\n\n\n