Split.Text is a system for splitting data in plain text format, where there may be multiple fields that need to be separated into different columns. The Usage (opens in new tab) page and the Split.Text
sample project (opens in new tab) show examples of how to use the Split.Text API. The Split.Text system supports purely predictive as well as interactive techniques to learn programs for splitting textual data.
Predictive Splitting
The predictive learning technique attempts to infer a program given only the input data and no other constraints from the user (such as output examples). It analyses the properties of the input data to infer the most regular pattern of fields and delimiters that have good alignment with one another. For instance, if we are given the following input data without any output examples:
Input |
---|
PE5 Leonard Robledo (Australia) |
U109 Adam Jay Lucas (New Zealand) |
R342 Carrie Dodson (United States) |
TS51 Naomi Cole (Canada) |
Y722 Owen Murphy (United States) |
UP335 Zoe Erin Rees (GB) |
Split.Text will predictively generate a program to perform the following three-column splitting:
Split Column 1 | Split Column 2 | Split Column 3 |
---|---|---|
PE5 | Leonard Robledo | Australia |
U109 | Adam Jay Lucas | New Zealand |
R342 | Carrie Dodson | United States |
TS51 | Naomi Cole | Canada |
Y722 | Owen Murphy | United States |
UP335 | Zoe Erin Rees | GB |
In this case it determines the space as well as open/close brackets as probable delimiters given the pattern in the inputs. However, not all occurrences of the space character is a delimiter, as there are varying number of spaces inside the person names (some including middle names) and countries as well. Hence we cannot simply split by all spaces. The Split.Text DSL and learning algorithm handles such scenarios by analyzing the patterns within the inferred data fields as well as supporting contextual delimiters, which look at data patterns around occurrences of possible delimiting substrings. More information about the DSL and learning techniques can be found in our recent publication on predictive program synthesis. (opens in new tab)
Interactive Splitting
The predictive inference of Split.Text can handle many common practical scenarios for text splitting. However, in many cases different users may have different preferences for the kind of splitting they want, especially with respect to how they want to split a particular field into subfields. For example, in the above scenario, one user may want to separate the first names into a separate column while another may prefer to have just the last name in its own column. Split.Text supports such scenarios with interactive features that permit the user to provide various constraints on the program that will be learnt.
The most powerful constraint is to provide examples of the desired splitting on some inputs. For instance, if the user wants first names to be split into a separate column, she may provide the following examples on the first two inputs:
Input | Split Column 1 | Split Column 2 | Split Column 3 | Split Column 4 |
---|---|---|---|---|
PE5 Leonard Robledo (Australia) | PE5 | Leonard | Robledo | Australia |
U109 Adam Jay Lucas (New Zealand) | U109 | Adam | Jay Lucas | New Zealand |
The system will then learn a program that can perform the same splitting on the rest of the data:
Input | Split Column 1 | Split Column 2 | Split Column 3 | Split Column 4 |
---|---|---|---|---|
PE5 Leonard Robledo (Australia) | PE5 | Leonard | Robledo | Australia |
U109 Adam Jay Lucas (New Zealand) | U109 | Adam | Jay Lucas | New Zealand |
R342 Carrie Dodson (United States) | R342 | Carrie | Dodson | United States |
TS51 Naomi Cole (Canada) | TS51 | Naomi | Cole | Canada |
Y722 Owen Murphy (United States) | Y722 | Owen | Murphy | United States |
UP335 Zoe Erin Rees (GB) | UP335 | Zoe | Erin Rees | GB |
If another user wants last names to be in a separate column, then he can similarly provide the corresponding examples to achieve that splitting:
Input | Split Column 1 | Split Column 2 | Split Column 3 | Split Column 4 |
---|---|---|---|---|
PE5 Leonard Robledo (Australia) | PE5 | Leonard | Robledo | Australia |
U109 Adam Jay Lucas (New Zealand) | U109 | Adam Jay | Lucas | New Zealand |
The system will then learn a program that can perform the same splitting on the rest of the data:
Input | Split Column 1 | Split Column 2 | Split Column 3 | Split Column 4 |
---|---|---|---|---|
PE5 Leonard Robledo (Australia) | PE5 | Leonard | Robledo | Australia |
U109 Adam Jay Lucas (New Zealand) | U109 | Adam Jay | Lucas | New Zealand |
R342 Carrie Dodson (United States) | R342 | Carrie | Dodson | United States |
TS51 Naomi Cole (Canada) | TS51 | Naomi | Cole | Canada |
Y722 Owen Murphy (United States) | Y722 | Owen | Murphy | United States |
UP335 Zoe Erin Rees (GB) | UP335 | Zoe Erin | Rees | GB |
As well as the ability to provide examples, Split.Text supports various other constraints, such as whether the user wants to keep the delimiters in separate columns or not.