{"id":851994,"date":"2022-06-17T16:20:12","date_gmt":"2022-06-17T23:20:12","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=851994"},"modified":"2022-09-13T13:44:19","modified_gmt":"2022-09-13T20:44:19","slug":"object-detection-in-the-wild-via-grounded-language-image-pre-training","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/object-detection-in-the-wild-via-grounded-language-image-pre-training\/","title":{"rendered":"Object Detection in the Wild via Grounded Language Image Pre-training"},"content":{"rendered":"\n

Visual recognition systems are typically trained to predict a fixed set of predetermined object categories in a specific domain, which limits their usability in real-world applications. How to build a model that generalizes to various concepts and domains with minimal annotations?<\/em> While great progress has been made on coarse-grained (image-level)<\/em> recognition such as CLIP (opens in new tab)<\/span><\/a>, generalizable fine-grained (object-leve<\/em>l) localization ability (e.g., object detection) remains an open challenge. Existing detection and segmentation models are \u201cgood at one task but one task only and require significant effort to adapt to a new task\u201d.  <\/p>\n\n\n\n

In this blog, we introduce our recent efforts on building a generalizable localization model with language supervision (GLIP). GLIP<\/strong><\/a> and GLIPv2<\/strong><\/a> enable unification between localization and vision-language understanding, paving a way towards a unified CV foundation model. GLIP is accepted in CVRP 2022, winning the Best Paper Finalist Award. <\/p>\n\n\n\n

GLIP (Grounded Language-Image Pre-training) is a generalizable object detection (we use object detection as the representative of localization tasks<\/em>) model. As illustrated in Figure 1, it is language aware, taking a natural language prompt as instruction. It is also semantic rich, able to detect millions of visual concepts out-of-box. GLIPv2 further extends such ability to instance segmentation and grounded vision-language understanding tasks; see examples in Figure 2. GLIP introduces language into object detection and leverages self-training techniques to pre-train on scalable and semantic-rich data: grounded image-captions (24M). This marks a milestone towards generalizable localization models: as shown in Figure 3, GLIP enjoys superior zero-shot and few-shot transfer ability similar to that of CLIP (opens in new tab)<\/span><\/a>\/GPT-2 (opens in new tab)<\/span><\/a>\/GPT-3 (opens in new tab)<\/span><\/a>. We also release a\u00a0HuggingFace Demo (opens in new tab)<\/span><\/a>. Feel feee to give it a try.<\/p>\n\n\n\n

\"Figure
Figure 1: GLIP detects objects based on a text prompt. Its zero-shot performance surpasses supervised detection models on established benchmarks (COCO & LVIS) and generalizes to various downstream tasks \u2013 the Object Detection in the Wild Benchmark (ODinW), introduced in GLIP. The visualizations are from the zero-shot (not trained on any of the task data) GLIP.<\/figcaption><\/figure>\n\n\n\n
\"Figure
Figure 2: GLIPv2 extends the generalization ability of GLIP to instance\/referring segmentation (Row 1 and 2) and grounded vision-language understanding tasks, such as grounded VQA (Row 3) and grounded image captioning (Row 4).<\/figcaption><\/figure>\n\n\n\n
\"Figure
Figure 3. (Left) GLIP shows great data efficiency on 13 downstream tasks (ODinW): zero-shot GLIP rivals with few-shot baselines; few-shot GLIP rivals with fully-supervised baselines. (Right) Prompt tuning with GLIP almost matches full fine-tuning.<\/figcaption><\/figure>\n\n\n\n

Object detection as a vision-language task<\/h3>\n\n\n\n
\"Figure
Figure 4. Architecture of GLIP.<\/figcaption><\/figure>\n\n\n\n

At the core of GLIP is the reformulation of object detection as a vision-language task: <\/strong>the model is not trained to predict objects with a multi-class classifier for specific benchmarks; rather, we reformulate object detection as phrase grounding. The model takes in an image and a text prompt \u2013 either a synthesized sentence as a concatenation of category names (for detection) or a natural language sentence (for phrase grounding); the task is to identify the correspondence between phrases in the prompt and objects (or regions) in an image.<\/p>\n\n\n\n

We also introduce deep fusion into the model. The language features are computed using a language model, which gives the new detection (or grounding) model a dual-encoder structure. Different from CLIP that fuses vision and language only at the last dot product layer, we show that deep cross-modality fusion applied by GLIP, as shown in Figure 4 (Middle), is crucial to learn high-quality language-aware visual representations.<\/p>\n\n\n\n

\"Figure
Figure 5. Grounding predictions from GLIP. GLIP can locate rare entities, phrases with attributes, and even abstract words.<\/figcaption><\/figure>\n\n\n\n

This reformulation allows us to pre-train GLIP on scalable<\/em> and semantic-rich<\/em> data: millions of image-caption pairs with millions of unique grounded phrases. Given a good grounding model (a teacher GLIP trained on a moderate amount of gold grounding data), we can automatically generate grounding boxes for massive image-text-paired data and train a student GLIP model. We showcase two real examples of the generated boxes in Figure 5. Training on such semantic-rich data delivers a semantic-rich student model. In contrast, prior work (opens in new tab)<\/span><\/a> on scaling detection data simply cannot predict concepts out of the teacher models’ pre-defined vocabulary.<\/p>\n\n\n\n

Flexible transfer ability<\/h3>\n\n\n\n

Zero-shot GLIP can surpass established supervised models: <\/strong> GLIP can \u201czero-shot\u201d transfer to a new detection task by simplifying rewriting the candidate categories into a language prompt. See Figure 1 and Figure 3 (left; data amount = 0) for an example. <\/p>\n\n\n\n

When writing the prompt, one could take the default approach by simply concatenating all the object names with \u201c . \u201d; one could also inject domain knowledge by describing the rare objects with attributes and language context. See below where we designed custom prompts for 6 datasets and observe significant performance improvement without any parameter change.<\/p>\n\n\n\n

\"Table
Table 1. Transfer to novel concepts by writing descriptive prompts.<\/figcaption><\/figure>\n\n\n\n

Few-shot \/ full-data fine-tuning<\/strong>: GLIP serves as a strong pre-trained checkpoint for easy adaptation to various tasks. When fine-tuned on COCO, GLIP (Large) achieves 60.8 AP on COCO 2017val and 61.5 on test-dev, surpassing the current public SoTA models; on 13 downstream tasks, a 1-shot GLIP rivals with a fully supervised Dynamic Head (see Figure 3).<\/p>\n\n\n\n

COCO<\/td>PascalVOC<\/td>AerialDrone<\/td>Aquarium<\/td>Rabbits<\/td>EgoHands<\/td>Mushrooms<\/td><\/tr>
58.8<\/td>72.9\/86.7<\/td>23.0<\/td>51.8<\/td>72.0<\/td>75.8<\/td>88.1<\/td><\/tr>
Packages<\/td>Racoon<\/td>Shellfish<\/td>Vehicles<\/td>Pistols<\/td>Pothole<\/td>Thermal<\/td><\/tr>
75.2<\/td>69.5<\/td>73.6<\/td>72.1<\/td>73.7<\/td>53.5<\/td>81.4<\/td><\/tr><\/tbody><\/table>
Table 2. One GLIP performs well for all tasks.<\/figcaption><\/figure>\n\n\n\n

One model for all detection tasks through prompt tuning<\/strong>:  GLIP takes a language prompt as input; thus one could change the model predictions by tuning only the prompt embeddings. This is similar to linear probing but the key difference is that the language and visual representation in GLIP is deeply fused.  In Figure 5 (right), prompt tuning on GLIP almost matches full fine-tuning, while linear probing a conventional object detection cannot. This makes deploying GLIP efficient : one GLIP model can simultaneously perform well on all downstream tasks, reducing the fine-tuning and deployment cost. See Table 2 above.<\/p>\n\n\n\n

GLIPv2: Unifying localization and vision-language understanding<\/h3>\n\n\n\n
\"Figure
Figure 6. GLIPv2 can perform a wide range of tasks.<\/figcaption><\/figure>\n\n\n\n

The development of a general-purpose CV foundation model has been hindered by the distinction between localization tasks (traditionally considered as single-modality tasks) and vision-language (VL) understanding tasks such as visual question answering and image captioning. The reformulation technique in GLIP opens a new door: we could turn every localization task (e.g., object detection and instance segmentation) into a vision-language task. We introduce the recently upgraded GLIPv2, a unified model for various localization and VL understanding tasks with one model architecture; it shows mutual benefit between localization and VL understanding.<\/p>\n\n\n\n

\"Table
Table 3. GLIPv2 achieves near-SoTA performance on various localization and VL understanding tasks.<\/figcaption><\/figure>\n\n\n\n