When the COVID-19 pandemic was in its early stages, several agencies published infection and mortality data for different geographical regions in the public domain. This data appeared in web pages, CSV files, JSON files, and more. There was plenty of useful data out there, but before one could use this data to generate models and visualizations, one had to ingest the data into a tabular data frame and clean it. The task of extracting tables from the varied data sources is often the price one has to pay before reaping the benefit of insights gained from downstream data analysis.
Can we ease the pain in ingesting data? The PROSE team has built a SDK that provides an intelligent read file library call. This is envisioned as a one-stop shop for all data ingestion needs. The underlying technology, based on program synthesis, has been developed over a time period of about 6 years. The early investment in its research and development continues to pay dividends even today. The data extraction from text technology within PROSE has surfaced in a variety of products already: PowerShell’s ConvertFrom-String, Import Flat File Wizard in SSMS, and importing data from files in Power Query.
Any product that works on data imported from a file can potentially use PROSE’s data extraction technology. However, every product brings its own requirements on what information it can provide and consume and what user interaction model it can support. Consequently, the PROSE read file library supports a very permissive interface: it is flexible in what it accepts as input, and it provides detailed output. The minimal input is the file contents. In this case, the table extraction happens completely predictively. However, users can provide more. For example, users can provide information about the file, schema for the data, examples of the rows/columns in the expected output, and choice of delimiter. On the output, the PROSE file reader provides not only the output table, but also the parameters that were used to successfully parse the file into a table. It also provides code. A product can choose to just use some part of the output—just the code, the output table, or only the learnt parameters. The diversity in PROSE-enabled data extraction experience ranges from text-based command-line interfaces in PowerShell’s ConvertFrom-String, to UI forms in the Import Flat File Wizard in SSMS, and to a rich UI that shows and explains the generated code in Power Query.
Some of our recent and upcoming efforts include a Python backend which generates readable Python code for extracting data from text files, PROSE-enabled predictive data import in VS Model Builder (available in VS 16.6 preview), and interactive data import in Azure Notebooks. They are all powered by the same underlying core technology.
Indeed, the recent preview release of PROSE technology inside PowerQuery Text Connector had already helped users like Reid Haves (MVP) to easily ingest and transform complex data, a feature which he describes as “incredible”. PROSE technology continues to play an essential role in supporting our users and making them more productive—whenever, wherever, and however they work.
Supported Data Formats
- Web [Predictive, By-Example, Hybrid]
- Text [FlashExtract, Unravel]
- JSON
- Spreadsheet
Interaction Modalities
- Read Blog post Getting data on the table: PROSE-powered Data Extraction