Microsoft at CVPR 2024: Innovations in computer vision and AI research

Published June 17, 2024

Share this page

Microsoft is proud to sponsor the 41st annual Conference on Computer Vision and Pattern Recognition (CVPR 2024), held from June 17 to June 21. This premier conference covers a broad spectrum of topics in the field, including 3D reconstruction and modeling, action and motion analysis, video and image processing, synthetic data generation, neural networks, and many more. This year, 63 papers from Microsoft have been accepted, with six selected for oral presentations. This post highlights these contributions.

The diversity of these research projects reflects the interdisciplinary approach that Microsoft research teams have taken, from techniques that precisely recreate 3D human figures and perspectives in augmented reality (AR) to combining advanced image segmentation with synthetic data to better replicate real-world scenarios. Other projects demonstrate how researchers are combining machine learning with natural language processing and structured data, developing models that not only visualize but also interact with their environments. Collectively, these projects aim to improve machine perception and enable more accurate and responsive interactions with the world.

Oral presentations

BIOCLIP: A Vision Foundation Model for the Tree of Life

Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G. Campolongo, Chan Hee Song, David Carlyn, Li Dong, W. Dahdul, Charles Stewart, Tanya Y. Berger-Wolf, Wei-Lun Chao, Yu Su

The surge in images captured from diverse sources—from drones to smartphones—offers a rich source of biological data. To harness this potential, we introduce TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images, and BioCLIP, a foundation model intended for the biological sciences. BioCLIP, utilizing the TreeOfLife-10M’s vast array of organism images and structured knowledge, excels in fine-grained biological classification, outperforming existing models by significant margins and demonstrating strong generalizability.

EgoGen: An Egocentric Synthetic Data Generator

Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys

A critical challenge in augmented reality (AR) is simulating realistic anatomical movements to guide cameras for authentic egocentric views. To overcome this, the authors developed EgoGen, a sophisticated synthetic data generator that not only improves training data accuracy for egocentric tasks but also refines the integration of motion and perception. It offers a practical solution for creating realistic egocentric training data, with the goal of serving as a useful tool for egocentric computer vision research.

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan

Florence-2 introduces a unified, prompt-based vision foundation model capable of handling a variety of tasks, from captioning to object detection and segmentation. Designed to interpret text prompts as task instructions, Florence-2 generates text outputs across a spectrum of vision and vision-language tasks. This model’s training utilizes the FLD-5B dataset, which includes 5.4 billion annotations on 126 million images, developed using an iterative strategy of automated image annotation and continual model refinement.

LISA: Reasoning Segmentation via Large Language Model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia

This work introduces reasoning segmentation, a new segmentation task using complex query texts to generate segmentation masks. The authors also established a new benchmark, comprising over a thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation. Finally, the authors present Large Language Instructed Segmentation Assistant (LISA), a tool that combines the linguistic capabilities of large language models with the ability to produce segmentation masks. LISA effectively handles complex queries and shows robust zero-shot learning abilities, further enhanced by minimal fine-tuning.

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin (opens in new tab), Otmar Hilliges, Jie Song

MultiPly is a new framework for reconstructing multiple people in 3D from single-camera videos in natural settings. This technique employs a layered neural representation for the entire scene, refined through layer-wise differentiable volume rendering. Enhanced by a hybrid instance segmentation that combines self-supervised 3D and promptable 2D techniques, it provides reliable segmentation even with close interactions. The process uses confidence-guided optimization to alternately refine human poses and shapes, achieving high-fidelity, consistent 3D models.

SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes

Alexandros Delitzas, Ayça Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, Francis Engelmann

Traditional 3D scene understanding methods are heavily focused on 3D sematic and instance segmentation, but the true challenge lies in interacting with functional interactive elements like handles, knobs, and buttons to achieve specific tasks. Enter SceneFun3D: a robust dataset featuring over 14,800 precise interaction annotations across 710 high-resolution real-world 3D indoor scenes. This dataset enriches scene comprehension with motion parameters and task-specific natural language descriptions, facilitating advanced research in functionality segmentation, task-driven affordance grounding, and 3D motion estimation.

Discover more about our work and contributions to CVPR 2024, including our full list of publications and sessions, on our conference webpage.