{"id":851994,"date":"2022-06-17T16:20:12","date_gmt":"2022-06-17T23:20:12","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=851994"},"modified":"2022-09-13T13:44:19","modified_gmt":"2022-09-13T20:44:19","slug":"object-detection-in-the-wild-via-grounded-language-image-pre-training","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/object-detection-in-the-wild-via-grounded-language-image-pre-training\/","title":{"rendered":"Object Detection in the Wild via Grounded Language Image Pre-training"},"content":{"rendered":"\n
Visual recognition systems are typically trained to predict a fixed set of predetermined object categories in a specific domain, which limits their usability in real-world applications. How to build a model that generalizes to various concepts and domains with minimal annotations?<\/em> While great progress has been made on coarse-grained (image-level)<\/em> recognition such as CLIP (opens in new tab)<\/span><\/a>, generalizable fine-grained (object-leve<\/em>l) localization ability (e.g., object detection) remains an open challenge. Existing detection and segmentation models are \u201cgood at one task but one task only and require significant effort to adapt to a new task\u201d. <\/p>\n\n\n\n In this blog, we introduce our recent efforts on building a generalizable localization model with language supervision (GLIP). GLIP<\/strong><\/a> and GLIPv2<\/strong><\/a> enable unification between localization and vision-language understanding, paving a way towards a unified CV foundation model. GLIP is accepted in CVRP 2022, winning the Best Paper Finalist Award. <\/p>\n\n\n\n GLIP (Grounded Language-Image Pre-training) is a generalizable object detection (we use object detection as the representative of localization tasks<\/em>) model. As illustrated in Figure 1, it is language aware, taking a natural language prompt as instruction. It is also semantic rich, able to detect millions of visual concepts out-of-box. GLIPv2 further extends such ability to instance segmentation and grounded vision-language understanding tasks; see examples in Figure 2. GLIP introduces language into object detection and leverages self-training techniques to pre-train on scalable and semantic-rich data: grounded image-captions (24M). This marks a milestone towards generalizable localization models: as shown in Figure 3, GLIP enjoys superior zero-shot and few-shot transfer ability similar to that of CLIP (opens in new tab)<\/span><\/a>\/GPT-2 (opens in new tab)<\/span><\/a>\/GPT-3 (opens in new tab)<\/span><\/a>. We also release a\u00a0HuggingFace Demo (opens in new tab)<\/span><\/a>. Feel feee to give it a try.<\/p>\n\n\n\n At the core of GLIP is the reformulation of object detection as a vision-language task: <\/strong>the model is not trained to predict objects with a multi-class classifier for specific benchmarks; rather, we reformulate object detection as phrase grounding. The model takes in an image and a text prompt \u2013 either a synthesized sentence as a concatenation of category names (for detection) or a natural language sentence (for phrase grounding); the task is to identify the correspondence between phrases in the prompt and objects (or regions) in an image.<\/p>\n\n\n\n We also introduce deep fusion into the model. The language features are computed using a language model, which gives the new detection (or grounding) model a dual-encoder structure. Different from CLIP that fuses vision and language only at the last dot product layer, we show that deep cross-modality fusion applied by GLIP, as shown in Figure 4 (Middle), is crucial to learn high-quality language-aware visual representations.<\/p>\n\n\n\n This reformulation allows us to pre-train GLIP on scalable<\/em> and semantic-rich<\/em> data: millions of image-caption pairs with millions of unique grounded phrases. Given a good grounding model (a teacher GLIP trained on a moderate amount of gold grounding data), we can automatically generate grounding boxes for massive image-text-paired data and train a student GLIP model. We showcase two real examples of the generated boxes in Figure 5. Training on such semantic-rich data delivers a semantic-rich student model. In contrast, prior work (opens in new tab)<\/span><\/a> on scaling detection data simply cannot predict concepts out of the teacher models’ pre-defined vocabulary.<\/p>\n\n\n\n Zero-shot GLIP can surpass established supervised models: <\/strong> GLIP can \u201czero-shot\u201d transfer to a new detection task by simplifying rewriting the candidate categories into a language prompt. See Figure 1 and Figure 3 (left; data amount = 0) for an example. <\/p>\n\n\n\n When writing the prompt, one could take the default approach by simply concatenating all the object names with \u201c . \u201d; one could also inject domain knowledge by describing the rare objects with attributes and language context. See below where we designed custom prompts for 6 datasets and observe significant performance improvement without any parameter change.<\/p>\n\n\n\n Few-shot \/ full-data fine-tuning<\/strong>: GLIP serves as a strong pre-trained checkpoint for easy adaptation to various tasks. When fine-tuned on COCO, GLIP (Large) achieves 60.8 AP on COCO 2017val and 61.5 on test-dev, surpassing the current public SoTA models; on 13 downstream tasks, a 1-shot GLIP rivals with a fully supervised Dynamic Head (see Figure 3).<\/p>\n\n\n\nObject detection as a vision-language task<\/h3>\n\n\n\n
Flexible transfer ability<\/h3>\n\n\n\n