{"id":675270,"date":"2020-06-01T09:00:06","date_gmt":"2020-06-01T16:00:06","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=675270"},"modified":"2020-08-06T16:41:02","modified_gmt":"2020-08-06T23:41:02","slug":"getting-data-on-the-table-prose-powered-data-extraction","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/getting-data-on-the-table-prose-powered-data-extraction\/","title":{"rendered":"Getting data on the table: PROSE-powered Data Extraction"},"content":{"rendered":"

When the COVID-19 pandemic was in its early stages, several agencies published infection and mortality data for different geographical regions in the public domain.\u00a0 This data appeared in web pages, CSV files, JSON files, and more.\u00a0 There was plenty of useful data out there, but before one could use this data to generate models and visualizations, one had to ingest the data into a tabular data frame and clean it.\u00a0 The task of extracting tables from the varied data sources is often the price one has to pay before reaping the benefit of insights gained from downstream data analysis.<\/p>\n

\u200bCan we ease the pain in ingesting data? The PROSE team has built a SDK that provides an intelligent “read file” library call.\u00a0 This is envisioned as a one-stop shop for all data ingestion needs. The underlying technology, based on program synthesis, has been developed over a time period of about 6 years.\u00a0 The early investment in its research and development continues to pay dividends even today.\u00a0 The “data extraction from text” technology within PROSE has surfaced in a variety of products already: PowerShell’s ConvertFrom-String, Import Flat File Wizard in SSMS, and importing data from files in Power Query.<\/p>\n

Any product that works on data imported from a file can potentially use PROSE’s data extraction technology.\u00a0 However, every product brings its own requirements on what information it can provide and consume and what user interaction model it can support.\u00a0 Consequently, the PROSE “read file” library supports a very permissive interface: it is flexible in what it accepts as input, and it provides detailed output. The minimal input is the file contents. In this case, the table extraction happens completely predictively.\u00a0 However, users can provide more. For example, users can provide information about the file, schema for the data, examples of the rows\/columns in the expected output, and choice of delimiter.\u00a0 On the output, the PROSE “file reader” provides not only the output table, but also the parameters that were used to successfully parse the file into a table.\u00a0 It also provides code. A product can choose to just use some part of the output—just the code, the output table, or only the learnt parameters. The diversity in PROSE-enabled data extraction experience ranges from text-based command-line interfaces in PowerShell’s ConvertFrom-String, to UI forms in the Import Flat File Wizard in SSMS, and to a rich UI that shows and explains the generated code in Power Query.<\/p>\n

\"Import<\/p>\n

Power Query’s Import Text by Examples feature in action.<\/strong><\/p>\n

\u200bSome of our recent and upcoming efforts include a Python backend which generates readable Python code for extracting data from text files,\u00a0 PROSE-enabled predictive data import in VS Model Builder (available in VS 16.6 preview), and interactive data import in Azure Notebooks.\u00a0 They are all powered by the same underlying core technology.<\/p>\n

Indeed, the recent preview release of PROSE technology inside PowerQuery Text Connector had already helped users like Reid Haves (MVP) to easily ingest and transform complex data, a feature which he describes as “incredible.” PROSE technology continues to play an essential role in supporting our users and making them more productive—whenever, wherever, and however they work.<\/p>\n","protected":false},"excerpt":{"rendered":"

When the COVID-19 pandemic was in its early stages, several agencies published infection and mortality data for different geographical regions in the public domain.\u00a0 This data appeared in web pages, CSV files, JSON files, and more.\u00a0 There was plenty of useful data out there, but before one could use this data to generate models and […]<\/p>\n","protected":false},"author":38127,"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"msr-content-parent":663303,"footnotes":""},"research-area":[],"msr-locale":[268875],"class_list":["post-675270","msr-blog-post","type-msr-blog-post","status-publish","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":663303,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/675270"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38127"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/675270\/revisions"}],"predecessor-version":[{"id":683364,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/675270\/revisions\/683364"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=675270"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=675270"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=675270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}