\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\nHow to model structural knowledge<\/h2>\n\n\n\n After identifying and structuring the relationships within the generated prompt descriptions, we implement Hierarchical Prompt Tuning (HTP), a new prompt-tuning framework that organizes content hierarchically. This approach allows the visual language model to discern the different levels of information in a prompt, ranging from specific details to broader categories and overarching themes across multiple knowledge domains, as shown in Figure 3. This facilitates the model’s understanding of the connections among these elements, improving its ability to process complex queries across various topics.<\/p>\n\n\n\nFigure 3. HPT is based on a dual-path asymmetric network, which receives images and various types of text inputs.<\/figcaption><\/figure>\n\n\n\nCentral to this method is a state-of-the-art relationship-guided attention module, designed to help the model identify and analyze the complex interconnections among elements within a graph. This module also understands the interactions between different entities and attributes through a cross-level self-attention mechanism. Self-attention enables the model to assess and prioritize various parts of the input data\u2014here, the graph\u2014according to their relevance. \u201cCross-level\u201d self-attention extends this capability across various semantic layers within the graph, allowing the model to examine relationships at multiple levels of abstraction. This feature helps the model to discern the interrelations of prompts (or input commands\/questions) across these various levels, helping it gain a deeper understanding of the categories or concepts.<\/p>\n\n\n\n
Our findings offer valuable insights into a more effective approach to navigating and understanding complex linguistic data, improving the model\u2019s knowledge discovery and decision-making processes. Building on these advances, we refined the traditional approach to text encoding by introducing a hierarchical, prompted text encoder, shown in Figure 4. Our aim is to improve how textual information is aligned or correlated with visual data, a necessity for vision-language models that must interpret both text and visual inputs.<\/p>\n\n\n\nFigure 4. A hierarchical-prompted text encoder learns from multi-level prompts, with a relationship-guided attention module for modeling structural knowledge.<\/figcaption><\/figure>\n\n\n\nLooking ahead<\/h2>\n\n\n\n By incorporating structured knowledge into our model training frameworks, our research lays the groundwork for more sophisticated applications. One example is enhanced image captioning, where visual language models gain the ability to describe the contents of photographs, illustrations, or any visual media with greater accuracy and depth. This improvement could significantly benefit various applications, such as assisting visually impaired users. Additionally, we envision advances in text-to-image generation, enabling visual language models to produce visual representations that are more precise, detailed, and contextually relevant based on textual descriptions.<\/p>\n\n\n\n
Looking forward, we hope our research ignites a broader interest in exploring the role of structured knowledge in improving prompt tuning for both visual and language comprehension. This exploration is expected to extend the use of these models beyond basic classification tasks\u2014where models categorize or label data\u2014towards enabling more nuanced and accurate interactions between people and AI systems. By doing so, we pave the way for AI systems to more effectively interpret the complexities of human language.<\/p>\n\n\n\n
Acknowledgements<\/h2>\n\n\n\n Thank you to Yubin Wang for his contributions in implementing the algorithm and executing the experiments.<\/p>\n","protected":false},"excerpt":{"rendered":"
Using LLMs to create structured graphs of image descriptors can enhance the images generated by visual language models. Learn how structured knowledge can improve prompt tuning for both visual and language comprehension.<\/p>\n","protected":false},"author":37583,"featured_media":1009434,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1009260","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Xinyang Jiang","user_id":41802,"display_name":"Xinyang Jiang","author_link":"Xinyang Jiang<\/a>","is_active":false,"last_first":"Jiang, Xinyang","people_section":0,"alias":"xinyangjiang"},{"type":"guest","value":"yubin-wang","user_id":"1009401","display_name":"Yubin Wang","author_link":"Yubin Wang","is_active":true,"last_first":"Wang, Yubin","people_section":0,"alias":"yubin-wang"},{"type":"user_nicename","value":"Dongsheng Li","user_id":39402,"display_name":"Dongsheng Li","author_link":" Dongsheng Li<\/a>","is_active":false,"last_first":"Li, Dongsheng","people_section":0,"alias":"dongsli"},{"type":"guest","value":"cairong-zhao","user_id":"1009407","display_name":"Cairong Zhao","author_link":" Cairong Zhao<\/a>","is_active":true,"last_first":"Zhao, Cairong","people_section":0,"alias":"cairong-zhao"}],"msr_type":"Post","featured_image_thumbnail":" ","byline":" Xinyang Jiang<\/a>, Yubin Wang, Dongsheng Li<\/a>, and Cairong Zhao<\/a>","formattedDate":"February 27, 2024","formattedExcerpt":"Using LLMs to create structured graphs of image descriptors can enhance the images generated by visual language models. Learn how structured knowledge can improve prompt tuning for both visual and language comprehension.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1009260"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1009260"}],"version-history":[{"count":23,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1009260\/revisions"}],"predecessor-version":[{"id":1010025,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1009260\/revisions\/1010025"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1009434"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1009260"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1009260"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1009260"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1009260"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1009260"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1009260"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1009260"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1009260"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1009260"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1009260"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1009260"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}