{"id":445557,"date":"2017-12-01T00:00:40","date_gmt":"2017-12-01T08:00:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=445557"},"modified":"2018-10-16T20:06:30","modified_gmt":"2018-10-17T03:06:30","slug":"webrelate-integrating-web-data-with-spreadsheets-using-examples","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/webrelate-integrating-web-data-with-spreadsheets-using-examples\/","title":{"rendered":"WebRelate: Integrating Web Data with Spreadsheets using Examples"},"content":{"rendered":"

Data integration between web sources and relational data is a key challenge faced by data scientists and\u00a0spreadsheet users. There are two main challenges in programmatically joining web data with relational data.\u00a0First, most websites do not expose a direct interface to obtain tabular data, so the user needs to formulate a\u00a0logic to get to different webpages for each input row in the relational table. Second, after reaching the desired\u00a0webpage, the user needs to write complex scripts to extract the relevant data, which is often conditioned\u00a0on the input data. Since many data scientists and end-users come from diverse backgrounds, writing such\u00a0complex regular-expression based logical scripts to perform data integration tasks is unfortunately often\u00a0beyond their programming expertise.<\/p>\n

We present WebRelate, a system that allows users to join semi-structured web data with relational data\u00a0in spreadsheets using input-output examples. WebRelate decomposes the web data integration task into\u00a0two sub-tasks of i) URL learning and ii) input-dependent web extraction. We introduce a novel synthesis\u00a0paradigm called \u201cOutput-constrained Programming By Examples\u201d, which allows us to use the finite set of\u00a0possible outputs for the new inputs to efficiently constrain the search in the synthesis algorithm. We instantiate\u00a0this paradigm for the two sub-tasks in WebRelate. The first sub-task generates the URLs for the webpages\u00a0containing the desired data for all rows in the relational table. WebRelate achieves this by learning a string\u00a0transformation program using a few example URLs. The second sub-task uses examples of desired data to be\u00a0extracted from the corresponding webpages and learns a program to extract the data for the other rows. We\u00a0design expressive domain-specific languages for URL generation and web data extraction, and present efficient\u00a0synthesis algorithms for learning programs in these DSLs from few input-output examples. We evaluate\u00a0WebRelate on 88 real-world web data integration tasks taken from online help forums and Excel product\u00a0team, and show that WebRelate can learn the desired programs within few seconds using only 1 example for
\nthe majority of the tasks.<\/p>\n","protected":false},"excerpt":{"rendered":"

Data integration between web sources and relational data is a key challenge faced by data scientists and\u00a0spreadsheet users. There are two main challenges in programmatically joining web data with relational data.\u00a0First, most websites do not expose a direct interface to obtain tabular data, so the user needs to formulate a\u00a0logic to get to different webpages […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13560],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-445557","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-programming-languages-software-engineering","msr-locale-en_us"],"msr_publishername":"ACM","msr_edition":"Principles of Programming Languages 2018 (POPL 2018)","msr_affiliation":"","msr_published_date":"2018-01-01","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"445560","msr_publicationurl":"","msr_doi":"https:\/\/doi.org\/10.1145\/3158090","msr_publication_uploader":[{"type":"file","title":"webrelate_popl18","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2017\/12\/webrelate_popl18.pdf","id":445560,"label_id":0},{"type":"doi","title":"https:\/\/doi.org\/10.1145\/3158090","viewUrl":false,"id":false,"label_id":0}],"msr_related_uploader":"","msr_attachments":[],"msr-author-ordering":[{"type":"text","value":"Jeevana Priya Inala","user_id":0,"rest_url":false},{"type":"user_nicename","value":"risin","user_id":33413,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=risin"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[444528],"msr_group":[],"msr_project":[],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/445557"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/445557\/revisions"}],"predecessor-version":[{"id":445566,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/445557\/revisions\/445566"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=445557"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=445557"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=445557"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=445557"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=445557"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=445557"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=445557"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=445557"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=445557"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=445557"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=445557"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=445557"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=445557"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=445557"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=445557"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}