Bin Xiao

Principal Research Manager

关于

Bin Xiao is a Principal Research Manager of Microsoft GenAI Group. He is leading the multi-modality large language model development at Microsoft. His research interests include computer vision, deep learning and multi-modality large language models. His representative works include phi-3-vision (opens in new tab), Florence models (opens in new tab), and high-resolution network (HRNet) (opens in new tab).. More information can be found at https://leoxiaobin.github.io/. (opens in new tab)

精选内容

Phi-3-Vision

The Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

Florence: A New Foundation Model for Computer Vision

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. ...

DaViT: Dual Attention Vision Transformers

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both “spatial tokens” and “channel tokens”. With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show DaViT backbones achieve state-of-the-art performance on four different tasks. Specially, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K without extra training data, using 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Giant reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/microsoft/DaViT.

Unified Contrastive Learning in Image-Text-Label Space

Visual recognition is recently learned via either supervised learning on human-annotated image-label data or language-image contrastive learning with webly-crawled image-text pairs. While supervised learning may result in a more discriminative representation, language-image pretraining shows unprecedented zero-shot recognition capability, largely due to the different data sources and learning objectives. In this work, we introduce a new formulation by combining the two data sources into a common image-text-label space. In this new space, we further propose a new learning method, called Unified Contrastive Learning (UniCL) with a single learning objective to seamlessly prompt the synergy between two types of data. Extensive experiments show that our UniCL is an effective way of learning semantically rich yet discriminative representations, universally for zero-shot, linear-probing, fully finetune and transfer learning scenarios. Particularly, it attains gains up to 9.2% and 14.5% in average on zero-shot recognition benchmarks over the language-image contrastive learning and supervised learning methods, respectively. In linear probing setting, it also boosts the performance over the two methods by 7.3% and 3.4%, respectively. Our further study indicates that UniCL is also a good learner on pure image-label data, rivaling the supervised learning methods across three image classification datasets and two types of vision backbone, ResNet and vision Transformer. ResNet and Swin Transformer. Code is available at: https://github.com/microsoft/UniCL.

CvT: Introducing Convolutions to Vision Transformers

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolutional vision Transformers (CvT), that improves Vision Transformers (ViT) in performance and efficienty by introducing convolutions into ViT to yield the best of both disignes.

HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation

Bottom-up human pose estimation methods have difficulties in predicting the correct pose for small persons due to challenges in scale variation. In this paper, we present HigherHRNet: a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids. Equipped with multi-resolution supervision for training and multi-resolution aggregation for inference, the proposed approach is able to solve the scale variation challenge in bottom-up multi-person pose estimation and localize keypoints more precisely, especially for small person.

Bin Xiao

关于

精选内容

Phi-3-Vision

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Florence: A New Foundation Model for Computer Vision

DaViT: Dual Attention Vision Transformers

Unified Contrastive Learning in Image-Text-Label Space

CvT: Introducing Convolutions to Vision Transformers

HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation

High-Resolution Network: A universal neural architecture for visual recognition

Deep High-Resolution Representation Learning for Human Pose Estimation

Simple Baselines for Human Pose Estimation and Tracking

联系 Bin Xiao