Vision-Language Pre-training (VLP) has recently attracted rapidly growing attention from both the computer vision and NLP communities, especially due to the emergence of multimodal foundation models like CLIP, DALL-E, CoCa, and Flamingo. To summarize the most recent advances at the frontier of VLP research, at CVPR 2022, we organized a tutorial on this increasingly important topic, where we have discussed the recent advances on (1) VLP for image-text tasks, such as visual question answering, image captioning and retrieval, and visual grounding; (2) VLP for core vision tasks, such as image classification and object detection; (3) VLP for video-text tasks, such as video question answering, captioning and retrieval; and (4) VLP for text-to-image/video synthesis.
All our tutorial slides and video recordings can be found at our tutorial website (opens in new tab).