{"id":665793,"date":"2020-07-20T09:43:58","date_gmt":"2020-07-20T16:43:58","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=665793"},"modified":"2022-04-19T15:08:51","modified_gmt":"2022-04-19T22:08:51","slug":"prose-text-splitting","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-text-splitting\/","title":{"rendered":"PROSE – Text Splitting"},"content":{"rendered":"

Split.Text<\/strong> is a system for splitting data in plain text format, where there may be multiple fields that need to be separated into different columns. The Usage (opens in new tab)<\/span><\/a> page and the Split.Text<\/code> sample project (opens in new tab)<\/span><\/a> show examples of how to use the Split.Text API. The Split.Text system supports purely predictive as well as interactive techniques to learn programs for splitting textual data.<\/p>\n

Predictive Splitting<\/h2>\n

The predictive learning technique attempts to infer a program given only the input data and no other constraints from the user (such as output examples). It analyses the properties of the input data to infer the most regular pattern of fields and delimiters that have good alignment with one another. For instance, if we are given the following input data without any output examples:<\/p>\n\n\n\n\n\n\n\n\n\n\n
Input<\/th>\n<\/tr>\n<\/thead>\n
PE5 Leonard Robledo (Australia)<\/td>\n<\/tr>\n
U109 Adam Jay Lucas (New Zealand)<\/td>\n<\/tr>\n
R342 Carrie Dodson (United States)<\/td>\n<\/tr>\n
TS51 Naomi Cole (Canada)<\/td>\n<\/tr>\n
Y722 Owen Murphy (United States)<\/td>\n<\/tr>\n
UP335 Zoe Erin Rees (GB)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

 <\/p>\n

Split.Text will predictively generate a program to perform the following three-column splitting:<\/p>\n\n\n\n\n\n\n\n\n\n\n
Split Column 1<\/th>\nSplit Column 2<\/th>\nSplit Column 3<\/th>\n<\/tr>\n<\/thead>\n
PE5<\/td>\nLeonard Robledo<\/td>\nAustralia<\/td>\n<\/tr>\n
U109<\/td>\nAdam Jay Lucas<\/td>\nNew Zealand<\/td>\n<\/tr>\n
R342<\/td>\nCarrie Dodson<\/td>\nUnited States<\/td>\n<\/tr>\n
TS51<\/td>\nNaomi Cole<\/td>\nCanada<\/td>\n<\/tr>\n
Y722<\/td>\nOwen Murphy<\/td>\nUnited States<\/td>\n<\/tr>\n
UP335<\/td>\nZoe Erin Rees<\/td>\nGB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n
<\/div>\n

In this case it determines the space as well as open\/close brackets as probable delimiters given the pattern in the inputs. However, not all occurrences of the space character is a delimiter, as there are varying number of spaces inside the person names (some including middle names) and countries as well. Hence we cannot simply split by all spaces. The Split.Text DSL and learning algorithm handles such scenarios by analyzing the patterns within the inferred data fields as well as supporting contextual delimiters<\/em>, which look at data patterns around occurrences of possible delimiting substrings. More information about the DSL and learning techniques can be found in our recent publication on predictive program synthesis. (opens in new tab)<\/span><\/a><\/p>\n

Interactive Splitting<\/h2>\n

The predictive inference of Split.Text can handle many common practical scenarios for text splitting. However, in many cases different users may have different preferences for the kind of splitting they want, especially with respect to how they want to split a particular field into subfields. For example, in the above scenario, one user may want to separate the first names into a separate column while another may prefer to have just the last name in its own column. Split.Text supports such scenarios with interactive features that permit the user to provide various constraints on the program that will be learnt.<\/p>\n

The most powerful constraint is to provide examples of the desired splitting on some inputs. For instance, if the user wants first names to be split into a separate column, she may provide the following examples on the first two inputs:<\/p>\n\n\n\n\n\n\n
Input<\/th>\nSplit Column 1<\/th>\nSplit Column 2<\/th>\nSplit Column 3<\/th>\nSplit Column 4<\/th>\n<\/tr>\n<\/thead>\n
PE5 Leonard Robledo (Australia)<\/td>\nPE5<\/td>\nLeonard<\/td>\nRobledo<\/td>\nAustralia<\/td>\n<\/tr>\n
U109 Adam Jay Lucas (New Zealand)<\/td>\nU109<\/td>\nAdam<\/td>\nJay Lucas<\/td>\nNew Zealand<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n
<\/div>\n

The system will then learn a program that can perform the same splitting on the rest of the data:<\/p>\n\n\n\n\n\n\n\n\n\n\n
Input<\/th>\nSplit Column 1<\/th>\nSplit Column 2<\/th>\nSplit Column 3<\/th>\nSplit Column 4<\/th>\n<\/tr>\n<\/thead>\n
PE5 Leonard Robledo (Australia)<\/td>\nPE5<\/td>\nLeonard<\/td>\nRobledo<\/td>\nAustralia<\/td>\n<\/tr>\n
U109 Adam Jay Lucas (New Zealand)<\/td>\nU109<\/td>\nAdam<\/td>\nJay Lucas<\/td>\nNew Zealand<\/td>\n<\/tr>\n
R342 Carrie Dodson (United States)<\/td>\nR342<\/td>\nCarrie<\/td>\nDodson<\/td>\nUnited States<\/td>\n<\/tr>\n
TS51 Naomi Cole (Canada)<\/td>\nTS51<\/td>\nNaomi<\/td>\nCole<\/td>\nCanada<\/td>\n<\/tr>\n
Y722 Owen Murphy (United States)<\/td>\nY722<\/td>\nOwen<\/td>\nMurphy<\/td>\nUnited States<\/td>\n<\/tr>\n
UP335 Zoe Erin Rees (GB)<\/td>\nUP335<\/td>\nZoe<\/td>\nErin Rees<\/td>\nGB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n
<\/div>\n

If another user wants last names to be in a separate column, then he can similarly provide the corresponding examples to achieve that splitting:<\/p>\n\n\n\n\n\n\n
Input<\/th>\nSplit Column 1<\/th>\nSplit Column 2<\/th>\nSplit Column 3<\/th>\nSplit Column 4<\/th>\n<\/tr>\n<\/thead>\n
PE5 Leonard Robledo (Australia)<\/td>\nPE5<\/td>\nLeonard<\/td>\nRobledo<\/td>\nAustralia<\/td>\n<\/tr>\n
U109 Adam Jay Lucas (New Zealand)<\/td>\nU109<\/td>\nAdam Jay<\/td>\nLucas<\/td>\nNew Zealand<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

 <\/p>\n

The system will then learn a program that can perform the same splitting on the rest of the data:<\/p>\n\n\n\n\n\n\n\n\n\n\n
Input<\/th>\nSplit Column 1<\/th>\nSplit Column 2<\/th>\nSplit Column 3<\/th>\nSplit Column 4<\/th>\n<\/tr>\n<\/thead>\n
PE5 Leonard Robledo (Australia)<\/td>\nPE5<\/td>\nLeonard<\/td>\nRobledo<\/td>\nAustralia<\/td>\n<\/tr>\n
U109 Adam Jay Lucas (New Zealand)<\/td>\nU109<\/td>\nAdam Jay<\/td>\nLucas<\/td>\nNew Zealand<\/td>\n<\/tr>\n
R342 Carrie Dodson (United States)<\/td>\nR342<\/td>\nCarrie<\/td>\nDodson<\/td>\nUnited States<\/td>\n<\/tr>\n
TS51 Naomi Cole (Canada)<\/td>\nTS51<\/td>\nNaomi<\/td>\nCole<\/td>\nCanada<\/td>\n<\/tr>\n
Y722 Owen Murphy (United States)<\/td>\nY722<\/td>\nOwen<\/td>\nMurphy<\/td>\nUnited States<\/td>\n<\/tr>\n
UP335 Zoe Erin Rees (GB)<\/td>\nUP335<\/td>\nZoe Erin<\/td>\nRees<\/td>\nGB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

 <\/p>\n

As well as the ability to provide examples, Split.Text supports various other constraints, such as whether the user wants to keep the delimiters in separate columns or not.<\/p>\n","protected":false},"excerpt":{"rendered":"

Split.Text is a system for splitting data in plain text format, where there may be multiple fields that need to be separated into different columns.<\/p>\n","protected":false},"featured_media":674232,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[13556,13554,13560],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-665793","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-computer-interaction","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Usage","content":"The Split.Text APIs are accessed through the SplitSession<\/code> class. The user can create a new SplitSession<\/code> object, add input data and various constraints to the session, and then call the Learn()<\/code> method to obtain a SplitProgram<\/code>. This is the program that is learnt from the given input data and constraints. The SplitProgram<\/code>\u2019s key method is the Run()<\/code> method which executes the program to perform a split on any given text input.\r\n\r\nTo use Split.Text, one needs to reference Microsoft.ProgramSynthesis.Split.Text.dll<\/code>, Microsoft.ProgramSynthesis.Split.Text.Semantics.dll<\/code>\r\nand Microsoft.ProgramSynthesis.Split.Text.Learning.dll<\/code>, Microsoft.ProgramSynthesis.Extraction.Text.Semantics.dll<\/code> and Microsoft.ProgramSynthesis.Extraction.Text.Learning.dll<\/code>.\r\n\r\nThe complete code for the scenarios described in this walk-through is available in the Sample Project<\/a> which illustrates our API usage.\r\n

Initializing the session<\/h2>\r\nThe user can create a new Split session and add the input data as follows:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\n\/\/ create a new ProseSplit session\r\nvar splitSession = new SplitSession();\r\n\r\n\/\/ add the input rows to the session\r\n\/\/ each input is a StringRegion object containing the text to be split\r\nvar inputs = new List&lt;StringRegion&gt; {\r\n SplitSession.CreateStringRegion(\"PE5 Leonard Robledo (Australia)\"),\r\n SplitSession.CreateStringRegion(\"U109 Adam Jay Lucas (New Zealand)\"),\r\n SplitSession.CreateStringRegion(\"R342 Carrie Dodson (United States)\")\r\n};\r\nsplitSession.Inputs.Add(inputs);[\/code]\r\n\r\nEach row of text in the input data is added as a StringRegion<\/code> object created from the text content in that row. If we want we can also add some constraints to the session to specify basic properties of the desired splitting, such as whether we want to include the delimiters in the resulting split or not. If we do not want delimiters in the output, we can specify with a constraint as follows:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nsplitSession.Constraints.Add(new IncludeDelimitersInOutput(false));\r\n[\/code]\r\n\r\nWe can clear any constraints provided in the session at any time by calling the splitSession.RemoveAllConstraints()<\/code> method.\r\n

Learning a new split program<\/h2>\r\nSplit.Text<\/strong> can learn a program using only the provided input data in a purely predictive fashion, without any examples or other output constraints. This can be done by simply calling the Learn()<\/code> function after adding the inputs.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\n\/\/ call the learn function to learn a splitting program from the given input examples\r\nSplitProgram program = splitSession.Learn();\r\n\r\n\/\/ check if the program is null (no program could be learnt from the given inputs)\r\nif (program == null)\r\n{\r\n Console.WriteLine(\"No program learned.\");\r\n return;\r\n}[\/code]\r\n\r\n
<\/div>\r\n

Serializing\/Deserializing a program<\/h2>\r\nThe SplitProgram.Serialize()<\/code> method serializes the learned program to a string. The SplitProgramLoader.Instance.Load()<\/code> method deserializes the program text to a program.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\n\/\/ serialize the learnt program and then deserialize\r\nstring progText = program.Serialize();\r\nprogram = SplitProgramLoader.Instance.Load(progText);[\/code]\r\n\r\n
<\/div>\r\n

Executing the learnt program<\/h2>\r\nThe learnt split program can be executed on any input StringRegion<\/code> to produce an array of SplitCell<\/code>s. For example, we can execute the learnt program on each of the inputs as follows:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nSplitCell[][] splitResult =\r\ninputs.Select(input => program.Run(input)).ToArray();[\/code]\r\n\r\nEach SplitCell<\/code> object represents information about a single split cell. It\u2019s CellValue<\/code> field is the sub-region of the input that this split cell represents, and the IsDelimiter<\/code> flag indicates whether this split cell is a field or delimiter value. The learnt program can be executed indepedently of the Session<\/code> object on any new input text, and not just the inputs that have been entered into the session.\r\n\r\nExecuting the predictively learnt program on the three inputs given above, and having specified delimiters to not be included in the output, we get the following splitting:\r\n\r\n\r\n\r\n\r\n\r\n
PE5<\/td>\r\nLeonard Robledo<\/td>\r\nAustralia<\/td>\r\n<\/tr>\r\n
U109<\/td>\r\nAdam Jay Lucas<\/td>\r\nNew Zealand<\/td>\r\n<\/tr>\r\n
R342<\/td>\r\nCarrie Dodson<\/td>\r\nUnited States<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n
<\/div>\r\n

Providing examples constraints<\/h2>\r\nIf the user desires a different split, then they can provide examples constraints<\/em> to specify what kind of split they would like. For instance, if the user wants to separate the first name into a different split cell, then they can provide examples on some of the input rows as follows:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nsplitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 0, \"PE5\"));\r\nsplitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 1, \"Leonard\"));\r\nsplitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 2, \"Robledo\"));\r\nsplitSession.Constraints.Add(new NthExampleConstraint(inputs[0].Value, 3, \"Australia\"));\r\nsplitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 0, \"U109\"));\r\nsplitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 1, \"Adam\"));\r\nsplitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 2, \"Jay Lucas\"));\r\nsplitSession.Constraints.Add(new NthExampleConstraint(inputs[1].Value, 3, \"New Zealand\"));[\/code]\r\n\r\nEach NthExampleConstraint<\/code> takes three parameters: the input text on which the program will execute (the entire string), the index of the output split cell for which this example is being given, and the text value desired in that split cell. The examples constraints given above describe each of the four split cells that are desired for the first two inputs that have been given in this session. After calling Learn()<\/code> with these constraints, we obtain a program that produces the following output splitting on the three inputs given in this session:\r\n\r\n\r\n\r\n\r\n\r\n
PE5<\/td>\r\nLeonard<\/td>\r\nRobledo<\/td>\r\nAustralia<\/td>\r\n<\/tr>\r\n
U109<\/td>\r\nAdam<\/td>\r\nJay Lucas<\/td>\r\nNew Zealand<\/td>\r\n<\/tr>\r\n
R342<\/td>\r\nCarrie<\/td>\r\nDodson<\/td>\r\nUnited States<\/td>\r\n<\/tr>\r\n<\/tbody>\r\n<\/table>\r\n "}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Vu Le","user_id":39174,"people_section":"Section name 0","alias":"levu"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/665793"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":13,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/665793\/revisions"}],"predecessor-version":[{"id":836638,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/665793\/revisions\/836638"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/674232"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=665793"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=665793"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=665793"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=665793"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=665793"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}