Nouvelles et reportages
Chargement…
Blog de recherche Microsoft
ACAV100M: Scaling up self-supervised audio-visual learning with automatically curated internet videos
| Yale Song
The natural association between visual observations and their corresponding sounds has exhibited powerful self-supervision signals for learning video representations (opens in new tab), which makes the ever-growing amount of online video an attractive data source for self-supervised learning. However, online…
Blog de recherche Microsoft
Microsoft and NVIDIA introduce parameter-efficient multimodal transformers for video representation learning
| Yale Song
Understanding video is one of the most challenging problems in AI, and an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in…