{"id":973743,"date":"2023-10-06T11:19:49","date_gmt":"2023-10-06T18:19:49","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=973743"},"modified":"2024-06-10T09:40:18","modified_gmt":"2024-06-10T16:40:18","slug":"llava-large-language-and-vision-assistant","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/llava-large-language-and-vision-assistant\/","title":{"rendered":"LLaVA: Large Language and Vision Assistant"},"content":{"rendered":"\n
\n
\n\t
\n\t\t
\n\t\t\t\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

Building Next-Gen Multimodal Foundation Models for General-Purpose Assistants<\/h3>\n\n\n\n

LLaVA is an open-source project, collaborating with research community to advance the state-of-the-art in AI. LLaVA represents the first end-to-end trained large multimodal model (LMM) that achieves impressive chat capabilities mimicking spirits of the multimodal GPT-4. The LLaVA family continues growing to support more modalities, capabilities, applications and beyond.<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n<\/div>\n<\/div>\n\n\n\n\n\n

LLaVA represents a cost-efficient approach to building general-purpose multimodal assistant. It is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.<\/p>\n\n\n\n

\n
\n
\n
\n
<\/div>\n<\/div>\n\n\n\n
\n
\n
\"evolution\"<\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

Recent development<\/h3>\n\n\n\n