Lei Cui

Principal Research Manager

Kosmos-2.5: A Multimodal Literate Model

We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1)…

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text. Several methods alleviated this issue by incorporating…

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality,…

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level…

About

Dr. Lei CUI is a Principal Research Manager in Natural Language Computing (NLC) group at Microsoft Research Asia, Beijing, China. Lei joined MSR Asia in Jan 2015 after he received his Ph.D. degree from the Department of Computer Science at Harbin Institute of Technology. Lei’s research interests include Foundation Models and Multimodal AI.

Lei has been working in MSR Asia for 17 years starting from his internship in 2007. His work has contributed to several important products and research projects. He published 50+ research papers in top conferences in AI and NLP areas. Besides, he also held 20+ U.S. patents in AI and NLP areas.

Intern students I mentored:

Shan YANG (Master student from Tsinghua University)
Tao GE (Ph.D. student from Peking University)
Yukun ZHAO (Master student from Shandong University)
Chaoqun DUAN (Ph.D. student from Harbin Institute of Technology)
Qingfu ZHU (Ph.D. student from Harbin Institute of Technology)
Xiang DENG (Under-graduate student from University of Science and Technology of China)
Hangbo BAO (Ph.D. student from Harbin Institute of Technology)
Shuming MA (Master student from Peking University)
Minghao LI (Joint Ph.D. student from Beihang University)
Lin CHEN (Under-graduate student from Beihang University)
Wei LIU (Master student from Beihang University)
Tengchao LV (Master student from Peking University)
Shengjie LUO (Under-graduate student from Beihang University)
Yiheng XU (Under-graduate student from Harbin Institute of Technology)
Yang XU (Ph.D. student from Harbin Institute of Technology)
Zilong WANG (Ph.D. student from University of California, San Diego)
Han WANG (Master student from New York University)
Yupan HUANG (Joint Ph.D. studnet from Sun Yat-sen University)
Junlong LI (Under-graduate student from Shanghai Jiao Tong University)
Jingye CHEN (Ph.D. student from The Hong Kong University of Science and Technology)
Yilin JIA (Under-graduate student from Shanghai Jiao Tong University and University of Michigan)
Yuzhong ZHAO (Ph.D. student from University of Chinese Academy of Sciences)

Open Source Projects

TextDiffuser: Diffusion Models as Text Painters

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at https://aka.ms/textdiffuser

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks.

DiT: Self-supervised Pre-training for Document Image Transformer

DiT is a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, as well as table detection.

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this paper, we present \textbf{LayoutLMv2} by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672).

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread of pre-training models for NLP applications, they almost focused on text-level manipulation, while neglecting the layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model the interaction between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage the image features to incorporate the visual information of words into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).

TableBank: Table Benchmark for Image-based Table Detection and Recognition

We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousands human labeled examples, which is difficult to generalize on real world applications. With TableBank that contains 417K high-quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task.