{"id":1009260,"date":"2024-02-27T09:00:00","date_gmt":"2024-02-27T17:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1009260"},"modified":"2024-02-27T06:26:12","modified_gmt":"2024-02-27T14:26:12","slug":"structured-knowledge-from-llms-improves-prompt-learning-for-visual-language-models","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/structured-knowledge-from-llms-improves-prompt-learning-for-visual-language-models\/","title":{"rendered":"Structured knowledge from LLMs improves prompt learning for visual language models"},"content":{"rendered":"\n

This research paper was presented at the <\/em><\/strong>38th Annual AAAI Conference on Artificial Intelligence<\/em><\/strong> (opens in new tab)<\/span><\/a> (AAAI-24), the premier forum for advancing understanding of intelligence and its implementation in machines.<\/em><\/strong><\/p>\n\n\n\n

\"First<\/figure>\n\n\n\n

We\u2019re seeing remarkable abilities from visual language models in transforming text descriptions into images. However, creating high-quality visuals requires crafting precise prompts that capture the relationships among the different image elements, a capability that standard prompts lack. In our paper, \u201cLearning Hierarchical Prompt with Structured Linguistic Knowledge for Language Models<\/a>,\u201d presented at AAAI-24, we introduce a novel approach using large language models (LLMs) to enhance the images created by visual language models. By creating detailed graphs of image descriptions, we leverage LLMs\u2019 linguistic knowledge to produce richer images, expanding their utility in practical applications. <\/p>\n\n\n\n

\"An
Figure 1. A structured graph provides descriptions for each class name.<\/figcaption><\/figure>\n\n\n\n

Figure 1 illustrates our method for constructing a structured graph containing key details for each category, or class. These graphs contain structured information, with entities (objects, people, and concepts), attributes (characteristics), and the relationships between them. For example, when defining “water lily,” we include entities like “leaves” or “blooms”, their attributes, “round” and “white”, and then apply LLMs\u2019 reasoning capabilities to identify how these terms relate to each other. This is shown in Figure 2.<\/p>\n\n\n\n

\"The
Figure 2. With instructions fed into the LLM, we can receive category-related descriptions along with corresponding structured graphs.<\/figcaption><\/figure>\n\n\n\n\t
\n\t\t\n\n\t\t

\n\t\tSpotlight: Blog post<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"Research\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Research Focus: Week of September 9, 2024<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

How to model structural knowledge<\/h2>\n\n\n\n

After identifying and structuring the relationships within the generated prompt descriptions, we implement Hierarchical Prompt Tuning (HTP), a new prompt-tuning framework that organizes content hierarchically. This approach allows the visual language model to discern the different levels of information in a prompt, ranging from specific details to broader categories and overarching themes across multiple knowledge domains, as shown in Figure 3. This facilitates the model’s understanding of the connections among these elements, improving its ability to process complex queries across various topics.<\/p>\n\n\n\n

\"The
Figure 3. HPT is based on a dual-path asymmetric network, which receives images and various types of text inputs.<\/figcaption><\/figure>\n\n\n\n

Central to this method is a state-of-the-art relationship-guided attention module, designed to help the model identify and analyze the complex interconnections among elements within a graph. This module also understands the interactions between different entities and attributes through a cross-level self-attention mechanism. Self-attention enables the model to assess and prioritize various parts of the input data\u2014here, the graph\u2014according to their relevance. \u201cCross-level\u201d self-attention extends this capability across various semantic layers within the graph, allowing the model to examine relationships at multiple levels of abstraction. This feature helps the model to discern the interrelations of prompts (or input commands\/questions) across these various levels, helping it gain a deeper understanding of the categories or concepts.<\/p>\n\n\n\n

Our findings offer valuable insights into a more effective approach to navigating and understanding complex linguistic data, improving the model\u2019s knowledge discovery and decision-making processes. Building on these advances, we refined the traditional approach to text encoding by introducing a hierarchical, prompted text encoder, shown in Figure 4. Our aim is to improve how textual information is aligned or correlated with visual data, a necessity for vision-language models that must interpret both text and visual inputs.<\/p>\n\n\n\n

\"Frameowork
Figure 4. A hierarchical-prompted text encoder learns from multi-level prompts, with a relationship-guided attention module for modeling structural knowledge.<\/figcaption><\/figure>\n\n\n\n

Looking ahead<\/h2>\n\n\n\n

By incorporating structured knowledge into our model training frameworks, our research lays the groundwork for more sophisticated applications. One example is enhanced image captioning, where visual language models gain the ability to describe the contents of photographs, illustrations, or any visual media with greater accuracy and depth. This improvement could significantly benefit various applications, such as assisting visually impaired users. Additionally, we envision advances in text-to-image generation, enabling visual language models to produce visual representations that are more precise, detailed, and contextually relevant based on textual descriptions.<\/p>\n\n\n\n

Looking forward, we hope our research ignites a broader interest in exploring the role of structured knowledge in improving prompt tuning for both visual and language comprehension. This exploration is expected to extend the use of these models beyond basic classification tasks\u2014where models categorize or label data\u2014towards enabling more nuanced and accurate interactions between people and AI systems. By doing so, we pave the way for AI systems to more effectively interpret the complexities of human language.<\/p>\n\n\n\n

Acknowledgements<\/h2>\n\n\n\n

Thank you to Yubin Wang for his contributions in implementing the algorithm and executing the experiments.<\/p>\n","protected":false},"excerpt":{"rendered":"

Using LLMs to create structured graphs of image descriptors can enhance the images generated by visual language models. Learn how structured knowledge can improve prompt tuning for both visual and language comprehension.<\/p>\n","protected":false},"author":37583,"featured_media":1009434,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1009260","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Xinyang Jiang","user_id":41802,"display_name":"Xinyang Jiang","author_link":"Xinyang Jiang<\/a>","is_active":false,"last_first":"Jiang, Xinyang","people_section":0,"alias":"xinyangjiang"},{"type":"guest","value":"yubin-wang","user_id":"1009401","display_name":"Yubin Wang","author_link":"Yubin Wang","is_active":true,"last_first":"Wang, Yubin","people_section":0,"alias":"yubin-wang"},{"type":"user_nicename","value":"Dongsheng Li","user_id":39402,"display_name":"Dongsheng Li","author_link":"Dongsheng Li<\/a>","is_active":false,"last_first":"Li, Dongsheng","people_section":0,"alias":"dongsli"},{"type":"guest","value":"cairong-zhao","user_id":"1009407","display_name":"Cairong Zhao","author_link":"Cairong Zhao<\/a>","is_active":true,"last_first":"Zhao, Cairong","people_section":0,"alias":"cairong-zhao"}],"msr_type":"Post","featured_image_thumbnail":"\"First","byline":"Xinyang Jiang<\/a>, Yubin Wang, Dongsheng Li<\/a>, and Cairong Zhao<\/a>","formattedDate":"February 27, 2024","formattedExcerpt":"Using LLMs to create structured graphs of image descriptors can enhance the images generated by visual language models. Learn how structured knowledge can improve prompt tuning for both visual and language comprehension.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1009260"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1009260"}],"version-history":[{"count":23,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1009260\/revisions"}],"predecessor-version":[{"id":1010025,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1009260\/revisions\/1010025"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1009434"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1009260"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1009260"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1009260"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1009260"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1009260"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1009260"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1009260"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1009260"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1009260"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1009260"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1009260"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}