{"id":640743,"date":"2021-08-10T04:05:12","date_gmt":"2021-08-10T11:05:12","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=640743"},"modified":"2024-08-11T21:01:24","modified_gmt":"2024-08-12T04:01:24","slug":"document-ai","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/document-ai\/","title":{"rendered":"Document AI (Intelligent Document Processing)"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background-purple card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 align-self-center\">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 \">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading h2\" id=\"document-ai\">Document AI (Intelligent Document Processing)<\/h1>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p><strong><em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2111.08609\" target=\"_blank\" rel=\"noreferrer noopener\">Document AI<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/em><\/strong>, or <em>Document Intelligence<\/em>, is a new research topic that refers to techniques for automatically reading, understanding, and analyzing business documents. Understanding business documents is an incredibly challenging task due to the diversity of layouts and formats, inferior quality of scanned document images as well as the complexity of template structures.<\/p>\n\n\n\n<p>Starting in 2019, we released two benchmark datasets <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/tablebank\" target=\"_blank\" rel=\"noreferrer noopener\">TableBank<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/docbank\" target=\"_blank\" rel=\"noreferrer noopener\">DocBank<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which are used for table detection and recognition as well as the page object detection for documents. Recently, we released two new benchmark datasets, where <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/readingbank\" target=\"_blank\" rel=\"noreferrer noopener\">ReadingBank<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for the reading order detection task, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/xfund\" target=\"_blank\" rel=\"noreferrer noopener\">XFUND<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for the multi-lingual form understanding task that contains forms in seven languages.<\/p>\n\n\n\n<p>In addition to the benchmark datasets, we also proposed the multimodal <strong><em>Document Foundation Model<\/em><\/strong>, including the pre-trained <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/layoutlm\" target=\"_blank\" rel=\"noreferrer noopener\">LayoutLM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> model family for Document AI which has been widely adopted by 1<sup>st<\/sup> and 3<sup>rd<\/sup> party products and applications in Azure AI, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/form-recognizer\/\" target=\"_blank\" rel=\"noreferrer noopener\">Form Recognizer<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp; The LayoutLM\/LayoutXLM model family has been applied to a wide range of Document AI applications, including table detection, page object detection, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/layoutreader\" target=\"_blank\" rel=\"noreferrer noopener\">LayoutReader<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for reading order detection, form\/receipt\/invoice understanding, complex document understanding, document image classification, document VQA, etc., meanwhile achieving state-of-the-art performance across these benchmarks.<\/p>\n\n\n\n<p>Moreover, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/markuplm\">MarkupLM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is also proposed to jointly pre-train text and markup language in a single framework for markup-based VrDU tasks. Distinct from fixed-layout documents, markup-based documents provide another viewpoint for the document representation learning through markup structures because the 2D position information and document image information cannot be used straightforwardly during the pre-training. Instead, MarkupLM takes advantage of the tree-based markup structures to model the relationship among different units within the document.<\/p>\n\n\n\n<p>Recently, we presented our latest research for OCR, namely <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/trocr\" target=\"_blank\" rel=\"noreferrer noopener\">TrOCR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which is a Transformer-based OCR with a pre-trained image Transformer and a text Transformer. TrOCR is convolution free and can be easily adapted for multilingual text recognition as well as cloud\/edge deployment.<\/p>\n\n\n\n<p>Image Transformer has recently achieved considerable progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. We propose <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/msdit\">DiT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, as well as table detection, where significant improvements and new SOTA results have been achieved.<\/p>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/layoutlmv3\">LayoutLMv3<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a multimodal pre-trained Transformer for Document AI with unified text and image masking. Additionally, it is also pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric (e.g., form\/receipt understanding) and image-centric (e.g., document layout analysis, table detection) Document AI tasks.<\/p>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aka.ms\/xdoc\">XDoc<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a unified pre-trained model which deals with different document formats in a single model. For parameter efficiency, XDoc shares backbone parameters for different formats such as the word embedding layer and the Transformer layers. Meanwhile, adaptive layers are introduced with lightweight parameters to enhance the distinction across different formats. Experimental results have demonstrated that with only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost-effective for real-world deployment.<\/p>\n\n\n\n<p>The LayoutLM model family has become the Foundation Models of Document AI for many 1st party and 3rd party applications. Meanwhile, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/transformers\/model_doc\/layoutlm.html\" target=\"_blank\" rel=\"noreferrer noopener\">LayoutLM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/transformers\/model_doc\/layoutlmv2.html\" target=\"_blank\" rel=\"noreferrer noopener\">LayoutLMv2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/transformers\/model_doc\/layoutxlm.html\" target=\"_blank\" rel=\"noreferrer noopener\">LayoutXLM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/docs\/transformers\/model_doc\/layoutlmv3\" target=\"_blank\" rel=\"noreferrer noopener\"><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/docs\/transformers\/model_doc\/layoutlmv3\">LayoutLMv3<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/docs\/transformers\/model_doc\/trocr\">TrOCR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/docs\/transformers\/model_doc\/dit\">DiT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/docs\/transformers\/main\/en\/model_doc\/markuplm\" target=\"_blank\" rel=\"noreferrer noopener\">MarkupLM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> are now part of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/\" target=\"_blank\" rel=\"noreferrer noopener\">HuggingFace<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>!<\/p>\n\n\n\n<p>Contact: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/lecu\/\" target=\"_blank\" rel=\"noreferrer noopener\">Lei Cui<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/fuwei\/\" target=\"_blank\" rel=\"noreferrer noopener\">Furu Wei<\/a><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"project-repository\">Project Repository:<\/h3>\n\n\n\n<p>Model: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/unilm\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/microsoft\/unilm<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p>Data: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/doc-analysis\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/doc-analysis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"related-products\">Related Products:<\/h3>\n\n\n\n<p><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/azure.microsoft.com\/en-us\/services\/form-recognizer\/\" target=\"_blank\" rel=\"noreferrer noopener\">Form Recognizer<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n","protected":false},"excerpt":{"rendered":"<p>Document AI (opens in new tab), or Document Intelligence, is a new research topic that refers to techniques for automatically reading, understanding, and analyzing business documents. Understanding business documents is an incredibly challenging task due to the diversity of layouts and formats, inferior quality of scanned document images as well as the complexity of template [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13562,13545],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-640743","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-research-area-human-language-technologies","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2019-06-01","related-publications":[637800,639084,666489,715033,741238,765937,770647,773401,777148,786160,796849,824164,836272,844183,883899,970137],"related-downloads":[734353],"related-videos":[],"related-groups":[144735],"related-events":[],"related-opportunities":[],"related-posts":[914184,978693],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Lei Cui","user_id":32631,"people_section":"Section name 0","alias":"lecu"},{"type":"user_nicename","display_name":"Dinei Florencio","user_id":31633,"people_section":"Section name 0","alias":"dinei"},{"type":"user_nicename","display_name":"Bo-June (Paul) Hsu","user_id":33204,"people_section":"Section name 0","alias":"paulhsu"},{"type":"user_nicename","display_name":"Shaohan Huang","user_id":39709,"people_section":"Section name 0","alias":"shaohanh"},{"type":"user_nicename","display_name":"Yupan Huang","user_id":43368,"people_section":"Section name 0","alias":"yupanhuang"},{"type":"user_nicename","display_name":"Tengchao Lv","user_id":40609,"people_section":"Section name 0","alias":"tengchaolv"},{"type":"user_nicename","display_name":"Guoxin Wang","user_id":37089,"people_section":"Section name 0","alias":"guow"},{"type":"user_nicename","display_name":"Furu Wei","user_id":31830,"people_section":"Section name 0","alias":"fuwei"},{"type":"user_nicename","display_name":"Cha Zhang","user_id":31379,"people_section":"Section name 0","alias":"chazhang"}],"msr_research_lab":[199560],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/640743","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":37,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/640743\/revisions"}],"predecessor-version":[{"id":1068597,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/640743\/revisions\/1068597"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=640743"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=640743"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=640743"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=640743"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=640743"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}