Computer Vision Group

Empowering technologies for real-world vision-based systems

Nouvelles et reportages

Blog de recherche Microsoft

ACAV100M: Scaling up self-supervised audio-visual learning with automatically curated internet videos

octobre 28, 2021 | Yale Song

The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations (opens in new tab), which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online…

Blog de recherche Microsoft

Microsoft and NVIDIA introduce parameter-efficient multimodal transformers for video representation learning

mai 17, 2021 | Yale Song

Understanding video is one of the most challenging problems in AI, and an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in…