K-LITE: Learning Transferable Visual Models with External Knowledge
- Sheng Shen ,
- Chunyuan Li ,
- Xiaowei Hu ,
- Yujia Xie ,
- Jianwei Yang ,
- Pengchuan Zhang ,
- Zhe Gan ,
- Lijuan Wang ,
- Lu Yuan ,
- Ce Liu ,
- Kurt Keutzer ,
- Trevor Darrell ,
- Anna Rohrbach ,
- Jianfeng Gao
Oral Presentation (1%)
Recent state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This free form of supervision ensures high generality and usability of the learned visual models, based on extensive heuristics on data collection to cover as many visual concepts as possible. Alternatively, learning with external knowledge about images is a promising way which leverages a much more structured source of supervision. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge to build transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that can understand both visual concepts and their knowledge; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods.