{"id":335339,"date":"2016-12-12T13:29:23","date_gmt":"2016-12-12T21:29:23","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=335339"},"modified":"2021-09-24T10:55:03","modified_gmt":"2021-09-24T17:55:03","slug":"flashextract-framework-data-extraction-examples","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/flashextract-framework-data-extraction-examples\/","title":{"rendered":"FlashExtract: a framework for data extraction by examples"},"content":{"rendered":"

Various document types that combine model and view (e.g., text files, webpages, spreadsheets) make it easy to organize (possibly hierarchical) data, but make it difficult to extract raw data for any further manipulation or querying. We present a general framework FlashExtract to extract relevant data from semi-structured documents using examples. It includes: (a) an interaction model that allows end-users to give examples to extract various fields and to relate them in a hierarchical organization using structure and sequence constructs. (b) an inductive synthesis algorithm to synthesize the intended program from few examples in any underlying domain-specific language for data extraction that has been built using our specified algebra of few core operators (map, filter, merge, and pair). We describe instantiation of our framework to three different domains: text files, webpages, and spreadsheets. On our benchmark comprising 75 documents, FlashExtract is able to extract intended data using an average of 2.36 examples in 0.84 seconds per field.<\/p>\n","protected":false},"excerpt":{"rendered":"

Various document types that combine model and view (e.g., text files, webpages, spreadsheets) make it easy to organize (possibly hierarchical) data, but make it difficult to extract raw data for any further manipulation or querying. We present a general framework FlashExtract to extract relevant data from semi-structured documents using examples. It includes: (a) an interaction […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13561],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[246691,254287,248920,260119,260116,248503,260113,253456,253678,247558,250582],"msr-conference":[260122],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-335339","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-algorithms","msr-locale-en_us","msr-field-of-study-computer-science","msr-field-of-study-data-extraction","msr-field-of-study-data-mining","msr-field-of-study-hierarchical-organization","msr-field-of-study-inductive-synthesis","msr-field-of-study-information-retrieval","msr-field-of-study-merge-version-control","msr-field-of-study-program-synthesis","msr-field-of-study-programming-by-example","msr-field-of-study-raw-data","msr-field-of-study-web-page"],"msr_publishername":"ACM","msr_edition":"","msr_affiliation":"","msr_published_date":"2014-6-8","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"335342","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"doi","viewUrl":"false","id":"false","title":"10.1145\/2594291.2594333","label_id":"243106","label":0},{"type":"url","viewUrl":"false","id":"false","title":"http:\/\/www-cs-students.stanford.edu\/~adityagp\/courses\/cs598\/papers\/flash-extract.pdf","label_id":"243132","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/pldi14-flashextract.pdf","label_id":"243132","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/dblp.uni-trier.de\/db\/conf\/pldi\/pldi2014.html#LeG14","label_id":"243109","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/dl.acm.org\/doi\/10.1145\/2594291.2594333","label_id":"243109","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/research.microsoft.com\/en-us\/um\/people\/sumitg\/pubs\/pldi14-flashextract.pdf","label_id":"243109","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/flashextract-framework-data-extraction-examples\/","label_id":"243109","label":0}],"msr_related_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/flashextract-framework-data-extraction-examples\/","label_id":"243118","label":0}],"msr_attachments":[{"id":335345,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/pldi14-flashextract.pptx"},{"id":335342,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/pldi14-flashextract.pdf"}],"msr-author-ordering":[{"type":"user_nicename","value":"Vu Le","user_id":39174,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Vu Le"},{"type":"user_nicename","value":"Sumit Gulwani","user_id":33755,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Sumit Gulwani"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[663303],"msr_project":[817234,654579,665739,360452,361034],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":817234,"post_title":"PROSE-Powered Data Ingestion (a.k.a. Table Extraction)","post_name":"prose-powered-data-ingestion","post_type":"msr-project","post_date":"2022-01-31 17:52:00","post_modified":"2022-04-19 13:39:04","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-powered-data-ingestion\/","post_excerpt":"\u200bWhen the COVID-19 pandemic was in its early stages, several agencies published infection and mortality data for different geographical regions in the public domain. This data appeared in web pages, CSV files, JSON files, and more. There was plenty of useful data out there, but before one could use this data to generate models and visualizations, one had to ingest the data into a tabular data frame and clean it. The task of extracting tables from the varied…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/817234"}]}},{"ID":654579,"post_title":"PROSE Framework","post_name":"prose-framework","post_type":"msr-project","post_date":"2020-07-22 11:23:08","post_modified":"2022-07-22 10:31:53","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-framework\/","post_excerpt":"Microsoft PROSE SDK is a framework of technologies for programming by examples: automatic generation of programs from input-output examples.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/654579"}]}},{"ID":665739,"post_title":"PROSE - Text Extraction (FlashExtract)","post_name":"prose-text-extraction","post_type":"msr-project","post_date":"2020-07-20 09:38:42","post_modified":"2022-04-19 15:08:04","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-text-extraction\/","post_excerpt":"Extraction.Text extracts data from semi-structured text files using examples.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/665739"}]}},{"ID":360452,"post_title":"FlashExtract","post_name":"flashextract","post_type":"msr-project","post_date":"2017-02-01 14:54:31","post_modified":"2017-06-12 09:36:39","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/flashextract\/","post_excerpt":"(Programming-by-example APIs for extracting structured data from text\/log files by examples) The FlashExtract technology (published as a PLDI 2014 paper [pdf |ppt slides |Video |Video 2]) ships as features in Powershell and Azure OMS (Operations Management Suite). FlashExtract powers the ConvertFrom-String cmdlet in Powershell. Here's a Microsoft blogpost that explains this cmdlet. Here are some videos prepared by others to demonstrate this capability: Example-driven parsing Doug Finke's UI Here are some blogposts Powershell PowerShell 5.0…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/360452"}]}},{"ID":361034,"post_title":"Program Synthesis","post_name":"program-synthesis","post_type":"msr-project","post_date":"2017-02-02 13:37:27","post_modified":"2020-01-24 08:27:25","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/program-synthesis\/","post_excerpt":"Introduction Program synthesis is the task of automatically discovering an executable piece of code given user intent expressed using various forms of constraints such as input-output examples, demonstrations, natural language, etc. Program synthesis has direct applications for various classes of users in the technology pyramid: (100s of millions of) End Users (people who have access to a computational device but are not expert programmers): Helping them to create small snippets of code for performing repetitive…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/361034"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/335339"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/335339\/revisions"}],"predecessor-version":[{"id":776005,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/335339\/revisions\/776005"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=335339"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=335339"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=335339"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=335339"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=335339"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=335339"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=335339"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=335339"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=335339"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=335339"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=335339"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=335339"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=335339"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=335339"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=335339"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=335339"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}