Zero-Shot Detection via Vision and Language Knowledge Distillation

In this talk, I will introduce our recent work about ViLD, a training method via Vision and Language knowledge Distillation. We distill the knowledge from a pre-trained zero-shot image classification model (e.g., CLIP) into a two-stage detector (e.g., Mask R-CNN). Our method aligns the region embeddings in the detector to the text and image embeddings inferred by the pre-trained model. We use the text embeddings as the detection classifier, obtained by feeding category names into the pre-trained text encoder. We then minimize the distance between the region embeddings and image embeddings, obtained by feeding region proposals into the pre-trained image encoder. During inference, we include text embeddings of novel categories into the detection classifier for zero-shot detection. We benchmark the performance on LVISv1.0 dataset by holding out all rare categories as novel categories. ViLD obtains 16.1 mask APr with a Mask R-CNN (ResNet-50 FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8. The model can directly transfer to other datasets, achieving 72.2 AP50, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively.

Speaker Details

Yin Cui is a Research Scientist at Google. Yin’s research in learning-based computer vision focuses on label efficiency and multimodal. Before joining Google, he received a Ph.D. in Computer Science from Cornell University and Cornell Tech in 2019, advised by Professor Serge Belongie. Yin co-organized COCO Visual Recognition Workshops and Fine-Grained Visual Categorization Workshops at major computer vision conferences.

Date:
Speakers:
Yin Cui
Affiliation:
Google

Series: Microsoft Vision+Language Summer Talk Series