About
Linjie Li is a Researcher at the computer vision science group in Microsoft Cloud & AI.
Before joining Microsoft, Linjie obtained her Master’s degree in computer science from Purdue University in 2018. Her current research interests include Vision-and-Language Pre-training, Self-supervised Learning and Adversarial Training.
Selected Publications
UNITER: UNiversal Image-TExt Representation Learning
Joint image-text embedding is the bedrock for most Visionand-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text…
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels…
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
We present HERO, a Hierarchical EncodeR for Omni-representation learning, for large-scale video+language pre-training. HERO encodes multimodal inputs in a hierarchical fashion, where local textual context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global…
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors are trained independently and usually on tasks…