{"id":851994,"date":"2022-06-17T16:20:12","date_gmt":"2022-06-17T23:20:12","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=851994"},"modified":"2022-09-13T13:44:19","modified_gmt":"2022-09-13T20:44:19","slug":"object-detection-in-the-wild-via-grounded-language-image-pre-training","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/object-detection-in-the-wild-via-grounded-language-image-pre-training\/","title":{"rendered":"Object Detection in the Wild via Grounded Language Image Pre-training"},"content":{"rendered":"\n<p>Visual recognition systems are typically trained to predict a fixed set of predetermined object categories in a specific domain, which limits their usability in real-world applications. <em>How to build a model that generalizes to various concepts and domains with minimal annotations?<\/em> While great progress has been made on <em>coarse-grained (image-level)<\/em> recognition such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2103.00020\" target=\"_blank\" rel=\"noreferrer noopener\">CLIP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, generalizable <em>fine-grained (object-leve<\/em>l) localization ability (e.g., object detection) remains an open challenge. Existing detection and segmentation models are \u201cgood at one task but one task only and require significant effort to adapt to a new task\u201d.&nbsp;&nbsp;<\/p>\n\n\n\n<p>In this blog, we introduce our recent efforts on building a generalizable localization model with language supervision (GLIP). <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/grounded-language-image-pre-training\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>GLIP<\/strong><\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/glipv2-unifying-localization-and-vision-language-understanding\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>GLIPv2<\/strong><\/a> enable unification between localization and vision-language understanding, paving a way towards a unified CV foundation model. GLIP is accepted in CVRP 2022, winning the Best Paper Finalist Award. <\/p>\n\n\n\n<p>GLIP (Grounded Language-Image Pre-training) is a generalizable object detection (<em>we use object detection as the representative of localization tasks<\/em>) model. As illustrated in Figure 1, it is language aware, taking a natural language prompt as instruction. It is also semantic rich, able to detect millions of visual concepts out-of-box. GLIPv2 further extends such ability to instance segmentation and grounded vision-language understanding tasks; see examples in Figure 2. GLIP introduces language into object detection and leverages self-training techniques to pre-train on scalable and semantic-rich data: grounded image-captions (24M). This marks a milestone towards generalizable localization models: as shown in Figure 3, GLIP enjoys superior zero-shot and few-shot transfer ability similar to that of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2103.00020\" target=\"_blank\" rel=\"noreferrer noopener\">CLIP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\/<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/cdn.openai.com\/better-language-models\/language_models_are_unsupervised_multitask_learners.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">GPT-2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\/<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2005.14165\" target=\"_blank\" rel=\"noreferrer noopener\">GPT-3<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. We also release a\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/spaces\/haotiz\/glip-zeroshot-demo\">HuggingFace Demo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Feel feee to give it a try.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"752\" height=\"1024\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_predictions_Fig1-752x1024.jpg\" alt=\"Figure 1: GLIP detects objects based on a text prompt. Its zero-shot performance surpasses supervised detection models on established benchmarks (COCO & LVIS) and generalizes to various downstream tasks \u2013 the Object Detection in the Wild Benchmark (ODinW), introduced in GLIP. The visualizations are from the zero-shot (not trained on any of the task data) GLIP. \" class=\"wp-image-852906\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_predictions_Fig1-752x1024.jpg 752w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_predictions_Fig1-220x300.jpg 220w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_predictions_Fig1-768x1046.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_predictions_Fig1-1128x1536.jpg 1128w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_predictions_Fig1-132x180.jpg 132w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_predictions_Fig1.jpg 1400w\" sizes=\"auto, (max-width: 752px) 100vw, 752px\" \/><figcaption>Figure 1: GLIP detects objects based on a text prompt. Its zero-shot performance surpasses supervised detection models on established benchmarks (COCO & LVIS) and generalizes to various downstream tasks \u2013 the Object Detection in the Wild Benchmark (ODinW), introduced in GLIP. The visualizations are from the zero-shot (not trained on any of the task data) GLIP.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"856\" height=\"1024\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_generalization_Fig2-856x1024.jpg\" alt=\"Figure 2: GLIPv2 extends the generalization ability of GLIP to instance\/referring segmentation (Row 1 and 2) and grounded vision-language understanding tasks, such as grounded VQA (Row 3) and grounded image captioning (Row 4).\" class=\"wp-image-852909\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_generalization_Fig2-856x1024.jpg 856w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_generalization_Fig2-251x300.jpg 251w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_generalization_Fig2-768x918.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_generalization_Fig2-1285x1536.jpg 1285w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_generalization_Fig2-151x180.jpg 151w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_generalization_Fig2.jpg 1400w\" sizes=\"auto, (max-width: 856px) 100vw, 856px\" \/><figcaption>Figure 2: GLIPv2 extends the generalization ability of GLIP to instance\/referring segmentation (Row 1 and 2) and grounded vision-language understanding tasks, such as grounded VQA (Row 3) and grounded image captioning (Row 4).<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"512\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_downstream-tasks_Fig3-1024x512.png\" alt=\"Figure 3. (Left) GLIP shows great data efficiency on 13 downstream tasks (ODinW): zero-shot GLIP rivals with few-shot baselines; few-shot GLIP rivals with fully-supervised baselines. (Right) Prompt tuning with GLIP almost matches full fine-tuning.\" class=\"wp-image-852900\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_downstream-tasks_Fig3-1024x512.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_downstream-tasks_Fig3-300x150.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_downstream-tasks_Fig3-768x384.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_downstream-tasks_Fig3-240x120.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_downstream-tasks_Fig3.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 3. (Left) GLIP shows great data efficiency on 13 downstream tasks (ODinW): zero-shot GLIP rivals with few-shot baselines; few-shot GLIP rivals with fully-supervised baselines. (Right) Prompt tuning with GLIP almost matches full fine-tuning.<\/figcaption><\/figure>\n\n\n\n<h3 id=\"object-detection-as-a-vision-language-task\">Object detection as a vision-language task<\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"399\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_architecture_Fig4-1024x399.png\" alt=\"Figure 4. Architecture of GLIP.\" class=\"wp-image-852897\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_architecture_Fig4-1024x399.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_architecture_Fig4-300x117.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_architecture_Fig4-768x300.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_architecture_Fig4-240x94.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_architecture_Fig4.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 4. Architecture of GLIP.<\/figcaption><\/figure>\n\n\n\n<p>At the core of GLIP is <strong>the reformulation of object detection as a vision-language task: <\/strong>the model is not trained to predict objects with a multi-class classifier for specific benchmarks; rather, we reformulate object detection as phrase grounding. The model takes in an image and a text prompt \u2013 either a synthesized sentence as a concatenation of category names (for detection) or a natural language sentence (for phrase grounding); the task is to identify the correspondence between phrases in the prompt and objects (or regions) in an image.<\/p>\n\n\n\n<p>We also introduce deep fusion into the model. The language features are computed using a language model, which gives the new detection (or grounding) model a dual-encoder structure. Different from CLIP that fuses vision and language only at the last dot product layer, we show that deep cross-modality fusion applied by GLIP, as shown in Figure 4 (Middle), is crucial to learn high-quality language-aware visual representations.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"485\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_grounding-predictions_Fig5-1024x485.jpg\" alt=\"Figure 5. Grounding predictions from GLIP. GLIP can locate rare entities, phrases with attributes, and even abstract words.\" class=\"wp-image-852903\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_grounding-predictions_Fig5-1024x485.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_grounding-predictions_Fig5-300x142.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_grounding-predictions_Fig5-768x364.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_grounding-predictions_Fig5-240x114.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_grounding-predictions_Fig5.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 5. Grounding predictions from GLIP. GLIP can locate rare entities, phrases with attributes, and even abstract words.<\/figcaption><\/figure>\n\n\n\n<p>This reformulation allows us to pre-train GLIP on <em>scalable<\/em> and<em> semantic-rich<\/em> data: millions of image-caption pairs with millions of unique grounded phrases. Given a good grounding model (a teacher GLIP trained on a moderate amount of gold grounding data), we can automatically generate grounding boxes for massive image-text-paired data and train a student GLIP model. We showcase two real examples of the generated boxes in Figure 5. Training on such semantic-rich data delivers a semantic-rich student model. In contrast, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/27e9661e033a73a6ad8cefcde965c54d-Paper.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">prior work<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on scaling detection data simply cannot predict concepts out of the teacher models&#8217; pre-defined vocabulary.<\/p>\n\n\n\n<h3 id=\"flexible-transfer-ability\">Flexible transfer ability<\/h3>\n\n\n\n<p><strong>Zero-shot GLIP can surpass established supervised models: <\/strong>&nbsp;GLIP can \u201czero-shot\u201d transfer to a new detection task by simplifying rewriting the candidate categories into a language prompt. See Figure 1 and Figure 3 (left; data amount = 0) for an example.&nbsp;<\/p>\n\n\n\n<p>When writing the prompt, one could take the default approach by simply concatenating all the object names with \u201c . \u201d; one could also inject domain knowledge by describing the rare objects with attributes and language context. See below where we designed custom prompts for 6 datasets and observe significant performance improvement without any parameter change.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_descriptive-prompts_Table1.png\" alt=\"Table 1. Transfer to novel concepts by writing descriptive prompts.\" class=\"wp-image-852930\" width=\"700\" height=\"288\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_descriptive-prompts_Table1.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_descriptive-prompts_Table1-300x123.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_descriptive-prompts_Table1-1024x421.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_descriptive-prompts_Table1-768x316.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIP_descriptive-prompts_Table1-240x99.png 240w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><figcaption>Table 1. Transfer to novel concepts by writing descriptive prompts.<\/figcaption><\/figure>\n\n\n\n<p><strong>Few-shot \/ full-data fine-tuning<\/strong>: GLIP serves as a strong pre-trained checkpoint for easy adaptation to various tasks. When fine-tuned on COCO, GLIP (Large) achieves 60.8 AP on COCO 2017val and 61.5 on test-dev, surpassing the current public SoTA models; on 13 downstream tasks, a 1-shot GLIP rivals with a fully supervised Dynamic Head (see Figure 3).<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter is-style-stripes\"><table><tbody><tr><td>COCO<\/td><td>PascalVOC<\/td><td>AerialDrone<\/td><td>Aquarium<\/td><td>Rabbits<\/td><td>EgoHands<\/td><td>Mushrooms<\/td><\/tr><tr><td>58.8<\/td><td>72.9\/86.7<\/td><td>23.0<\/td><td>51.8<\/td><td>72.0<\/td><td>75.8<\/td><td>88.1<\/td><\/tr><tr><td>Packages<\/td><td>Racoon<\/td><td>Shellfish<\/td><td>Vehicles<\/td><td>Pistols<\/td><td>Pothole<\/td><td>Thermal<\/td><\/tr><tr><td>75.2<\/td><td>69.5<\/td><td>73.6<\/td><td>72.1<\/td><td>73.7<\/td><td>53.5<\/td><td>81.4<\/td><\/tr><\/tbody><\/table><figcaption>Table 2. One GLIP performs well for all tasks.<\/figcaption><\/figure>\n\n\n\n<p><strong>One model for all detection tasks through prompt tuning<\/strong>:&nbsp; GLIP takes a language prompt as input; thus one could change the model predictions by tuning only the prompt embeddings. This is similar to linear probing but the key difference is that the language and visual representation in GLIP is deeply fused.&nbsp; In Figure 5 (right), prompt tuning on GLIP almost matches full fine-tuning, while linear probing a conventional object detection cannot. This makes deploying GLIP efficient : one GLIP model can simultaneously perform well on all downstream tasks, reducing the fine-tuning and deployment cost. See Table 2 above.<\/p>\n\n\n\n<h3 id=\"glipv2-unifying-localization-and-vision-language-understanding\">GLIPv2: Unifying localization and vision-language understanding<\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"634\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_tasks_Fig6-1024x634.jpg\" alt=\"Figure 6. GLIPv2 can perform a wide range of tasks.\" class=\"wp-image-852918\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_tasks_Fig6-1024x634.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_tasks_Fig6-300x186.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_tasks_Fig6-768x476.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_tasks_Fig6-240x149.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_tasks_Fig6.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 6. GLIPv2 can perform a wide range of tasks.<\/figcaption><\/figure>\n\n\n\n<p>The development of a general-purpose CV foundation model has been hindered by the distinction between localization tasks (traditionally considered as single-modality tasks) and vision-language (VL) understanding tasks such as visual question answering and image captioning. The reformulation technique in GLIP opens a new door: we could turn every localization task (e.g., object detection and instance segmentation) into a vision-language task. We introduce the recently upgraded GLIPv2, a unified model for various localization and VL understanding tasks with one model architecture; it shows mutual benefit between localization and VL understanding.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"398\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_performance_Table3-1024x398.png\" alt=\"Table 3. GLIPv2 achieves near-SoTA performance on various localization and VL understanding tasks.\" class=\"wp-image-852915\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_performance_Table3-1024x398.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_performance_Table3-300x117.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_performance_Table3-768x298.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_performance_Table3-240x93.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_performance_Table3.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Table 3. GLIPv2 achieves near-SoTA performance on various localization and VL understanding tasks.<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>One Model Architecture for all<\/strong>: GLIPv2 achieves near SoTA performance on various localization and understanding tasks. See Table 3.<\/li><li><strong>One Set of Model Parameters for all<\/strong>: The pre-trained GLIPv2 can be effortlessly transferred to any object detection and grounding tasks without further fine-tuning; see Table 4 (Left). With the technique of prompt tuning, a single GLIPv2 model achieves comparable performance with multiple task-specific fully fine-tuned models; see Table 4 (Right).<\/li><li><strong>Grounded VL understanding<\/strong>: Inherently a grounding model, GLIPv2 leads to VL understanding models with strong grounding ability, which are self-explainable and easy to debug. When GLIPv2 is finetuned on VQA, it can answer questions while localizing mentioned entities, see Figure 2 for examples.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"193\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_model-parameters_Table4-1024x193.png\" alt=\"Table 4. One set of model parameters for all localization \/ grounding tasks.\" class=\"wp-image-852912\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_model-parameters_Table4-1024x193.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_model-parameters_Table4-300x57.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_model-parameters_Table4-768x145.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_model-parameters_Table4-240x45.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/06\/GLIPv2_model-parameters_Table4.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Table 4. One set of model parameters for all localization \/ grounding tasks.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"towards-language-augmented-visual-models\">Towards language-augmented visual models<\/h2>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/grounded-language-image-pre-training\/\">GLIP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/glipv2-unifying-localization-and-vision-language-understanding\/\">GLIPv2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> demonstrate our vision: visual models can be augmented with language to achieve unprecedented generalization ability. Accompanying GLIP and GLIPv2, we are also releasing the Object Detection in the Wild Benchmark (ODinW), to be hosted at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/computer-vision-in-the-wild.github.io\/eccv-2022\/\" target=\"_blank\" rel=\"noreferrer noopener\">ECCV Computer Vision in the Wild Workshop<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-css-opacity is-style-end-mark\" \/>\n\n\n\n<p><em>Acknowledgement: This research was conducted by Liunian Harold Li, Haotian Zhang, Pengchuan Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Xiaowei Hu, Yen-Chun Chen, Xiyang Dai, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao. Additional thanks go to the Microsoft Research Horizontal AI Team and Microsoft Alexander Multi-modal team for providing computer resources for large-scale training. The baseline models used in our experiments are based on the open-source code released in the GitHub repository; we acknowledge all the authors who made their code public, which tremendously accelerates our project progress.<\/em><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Visual recognition systems are typically trained to predict a fixed set of predetermined object categories in a specific domain, which limits their usability in real-world applications. How to build a model that generalizes to various concepts and domains with minimal annotations? While great progress has been made on coarse-grained (image-level) recognition such as CLIP (opens [&hellip;]<\/p>\n","protected":false},"author":40306,"featured_media":852918,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":689814,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-851994","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":689814,"type":"project"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/851994","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/40306"}],"version-history":[{"count":16,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/851994\/revisions"}],"predecessor-version":[{"id":877401,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/851994\/revisions\/877401"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/852918"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=851994"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=851994"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=851994"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=851994"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}