LLaVA represents a cost-efficient approach to building general-purpose multimodal assistant. It is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.
Recent development
- LLaVA: The first open-source project to GPT-V alternative. [Project (opens in new tab)] [Paper (opens in new tab)] [Github (opens in new tab)] [Demo (opens in new tab)] [Data (opens in new tab)] [Model (opens in new tab)] [Scaling Note (opens in new tab)]
- LLaVA-Med: The first multimodal assistant in the healthcare domain [Github (opens in new tab)] [Paper (opens in new tab)]
- LLaVA-Interactive: An all-in-one demo to demonstrate the visual interaction/generation capabilities beyond language interaction alone, supported by LLaVA (opens in new tab), SEEM (opens in new tab) and GLIGEN (opens in new tab).
- Multimodal Foundation Models: A 118-page survey on the evolution, trends and our position of multimodal foundation models: “Multimodal Foundation Models: From Specialists to General-Purpose Assistants” (opens in new tab). This is based on CVPR 2023 Tutorial (opens in new tab). [Note on Large Multimodal Models (opens in new tab)] [Slides (opens in new tab)] [YouTube (opens in new tab)] [Bilibili (opens in new tab)]
- Instruction Tuning with GPT-4: the “first attempt” to use GPT-4 data for LLM self-instruct tuning. [Project (opens in new tab)] [Paper (opens in new tab)] [Github (opens in new tab)] [My Learnings (opens in new tab)]