{"id":667026,"date":"2020-07-20T09:37:26","date_gmt":"2020-07-20T16:37:26","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=667026"},"modified":"2022-04-19T15:08:30","modified_gmt":"2022-04-19T22:08:30","slug":"prose-json-extraction","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-json-extraction\/","title":{"rendered":"PROSE – Json Extraction"},"content":{"rendered":"

PROSE <\/b>can automatically flatten a JSON file into a table. It supports extracting Newline Delimited Json (opens in new tab)<\/span><\/a> and truncated Json.<\/p>\n

Currently we support code generation in Python, Pyspark, and M.<\/p>\n

This feature is shipped in Power Query\/Power BI Power Query JSON connector | Microsoft Docs (opens in new tab)<\/span><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"

Extraction.Json automatically extracts tabular data from Json files. It supports extracting Newline Delimited Json and truncated Json.<\/p>\n","protected":false},"featured_media":674232,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[13556,13554,13560],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-667026","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-computer-interaction","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Usage","content":"The main entry point is Session<\/code> class\u2019s Learn()<\/code> method, which returns a Program<\/code> object. The Program<\/code>\u2019s key method is Run()<\/code> that executes the program on an input Json to obtain the extracted output. Each program also has a Schema<\/code> property that defines the structure of the extracted data.\r\n\r\nOther important methods are Serialize()<\/code> and Deserialize()<\/code> to serialize and deserialize Program<\/code> object.\r\n\r\nTo use Extraction.Json, one needs to reference:\r\n\r\nMicrosoft.ProgramSynthesis.Extraction.Json.dll<\/code>, Microsoft.ProgramSynthesis.Extraction.Json.Learner.dll<\/code>\r\nand Microsoft.ProgramSynthesis.Extraction.Json.Semantics.dll<\/code>.\r\n\r\nThe Sample Project<\/a> illustrates our API usage.\r\n

Basic Usage<\/h2>\r\nBy default, Extraction.Json<\/strong> learns a join<\/em> program in which inner arrays are joined with other fields. As a result, an outer object in the input Json can be flattened into several rows in the output table.\r\n\r\nThe below snippet illustrates a learning session to generate such program from the input jsonText<\/code>:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nstring jsonText = ... \r\nvar session = new Session(); \r\nsession.Inputs.Add(jsonText); \r\nProgram program = session.Learn(); \r\n[\/code]\r\n\r\nClients may add NoJoinInnerArrays<\/code> constraint to the session to learn non-join<\/code> programs, as illustrated in the following snippet:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar noJoinSession = new Session();\r\nsession.Inputs.Add(jsonText); \r\nnoJoinSession.Constraints.Add(new NoJoinInnerArrays());\r\nProgram noJoinProgram = noJoinSession.Learn();\r\n[\/code]\r\n\r\n

Serializing\/Deserializing a Program<\/h2>\r\nThe Extraction.Json.Program.Serialize()<\/code> method serializes the learned program to a string. The Extraction.Json.Loader.Instance.Load()<\/code> method deserializes the program text to a program.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\n\/\/ program was learned previously\r\nstring progText = program.Serialize();\r\nProgram loadProg = Loader.Instance.Load(progText);\r\n[\/code]\r\n\r\n
<\/div>\r\n

Executing a Program<\/h2>\r\nGiven an input Json, a program can generate a hierarchical tree or a flattened table. If the program is a join program, the table is flattened either using outer join<\/em> (default) or inner join<\/em> semantics.\r\n

Generating a Tree<\/h3>\r\nUse this method to obtain a hierarchical tree of the input document.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\n\/\/ program was learned previously\r\nITreeOutput tree = program.Run(jsonText);\r\n[\/code]\r\n\r\n

Generating a Table<\/h3>\r\nSupply the desired join semantics to the RunTable()<\/code> method as follows:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\n\/\/ program was learned previously\r\n\r\nIEnumerable outerJoinTable = program.RunTable(jsonText, TreeToTableSemantics.OuterJoin);\r\n\r\nIEnumerable innerJoinTable = program.RunTable(jsonText, TreeToTableSemantics.InnerJoin);\r\n[\/code]\r\n"}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Vu Le","user_id":39174,"people_section":"Section name 0","alias":"levu"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/667026"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":34,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/667026\/revisions"}],"predecessor-version":[{"id":786772,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/667026\/revisions\/786772"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/674232"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=667026"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=667026"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=667026"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=667026"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=667026"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}