{"id":334229,"date":"2017-01-01T11:25:17","date_gmt":"2017-01-01T19:25:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=334229"},"modified":"2018-10-16T22:27:17","modified_gmt":"2018-10-17T05:27:17","slug":"automated-data-extraction-using-predictive-program-synthesis","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/automated-data-extraction-using-predictive-program-synthesis\/","title":{"rendered":"Automated Data Extraction using Predictive Program Synthesis"},"content":{"rendered":"<p>In recent years there has been rising interest in the use of programming-by-example techniques to assist users in data manipulation tasks. Such techniques rely on an explicit input-output examples specification from the user to automatically synthesize programs. However, in a wide range of data extraction tasks it is easy for a human observer to predict the desired extraction by just observing the input data itself. Such predictive intelligence has not yet been explored in program synthesis research, and is what we address in this work. We describe a predictive program synthesis algorithm that infers programs in a general form of extraction DSLs (domain specific languages) given input-only examples. We describe concrete instantiations of such DSLs and the synthesis algorithm in the two practical application domains of text extraction and web extraction, and present an evaluation of our technique on a range of extraction tasks encountered in practice.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In recent years there has been rising interest in the use of programming-by-example techniques to assist users in data manipulation tasks. Such techniques rely on an explicit input-output examples specification from the user to automatically synthesize programs. However, in a wide range of data extraction tasks it is easy for a human observer to predict [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13556],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-334229","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_publishername":"Association for the Advancement of Artificial Intelligence","msr_edition":"AAAI 2017","msr_affiliation":"","msr_published_date":"2017-02-05","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"334232","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","title":"Automated Data Extraction using Predictive Program Synthesis","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/12\/aaai17.pdf","id":334232,"label_id":0}],"msr_related_uploader":"","msr_attachments":[],"msr-author-ordering":[{"type":"user_nicename","value":"Mohammad Raza","user_id":32997,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Mohammad Raza"},{"type":"user_nicename","value":"Sumit Gulwani","user_id":33755,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Sumit Gulwani"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[663303],"msr_project":[817234,654579,672951],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":817234,"post_title":"PROSE-Powered Data Ingestion (a.k.a. Table Extraction)","post_name":"prose-powered-data-ingestion","post_type":"msr-project","post_date":"2022-01-31 17:52:00","post_modified":"2022-04-19 13:39:04","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-powered-data-ingestion\/","post_excerpt":"\u200bWhen the COVID-19 pandemic was in its early stages, several agencies published infection and mortality data for different geographical regions in the public domain.&nbsp;This data appeared in web pages, CSV files, JSON files, and more.&nbsp;There was plenty of useful data out there, but before one could use this data to generate models and visualizations, one had to ingest the data into a tabular data frame and clean it.&nbsp;The task of extracting tables from the varied&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/817234"}]}},{"ID":654579,"post_title":"PROSE Framework","post_name":"prose-framework","post_type":"msr-project","post_date":"2020-07-22 11:23:08","post_modified":"2022-07-22 10:31:53","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-framework\/","post_excerpt":"Microsoft PROSE SDK is a framework of technologies for programming by examples: automatic generation of programs from input-output examples.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/654579"}]}},{"ID":672951,"post_title":"Predictive Program Synthesis","post_name":"predictive-program-synthesis","post_type":"msr-project","post_date":"2020-07-07 20:17:25","post_modified":"2022-04-19 15:47:55","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/predictive-program-synthesis\/","post_excerpt":"Program synthesis technologies help users to easily automate tasks that would otherwise require significant manual effort or programming skills. For instance, programming-by-example or natural language programming approaches allow the user to express intent by giving examples or natural language descriptions of the task, from which the system can synthesize a program in a formal programming language to complete the task. In this project, we are exploring the novel notion of predictive program synthesis, which is&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/672951"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/334229","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/334229\/revisions"}],"predecessor-version":[{"id":441450,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/334229\/revisions\/441450"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=334229"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=334229"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=334229"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=334229"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=334229"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=334229"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=334229"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=334229"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=334229"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=334229"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=334229"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=334229"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=334229"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=334229"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=334229"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=334229"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}