{"id":666933,"date":"2020-07-20T09:39:52","date_gmt":"2020-07-20T16:39:52","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=666933"},"modified":"2022-04-19T15:08:19","modified_gmt":"2022-04-19T22:08:19","slug":"prose-text-transformation","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-text-transformation\/","title":{"rendered":"PROSE – Text Transformation"},"content":{"rendered":"

Transformation.Text<\/code><\/strong> is a system that performs string transformations using examples allowing for many tasks involving strings to be performed automatically. Transformation.Text<\/code> is based on the same research as the FlashFill feature in Excel (opens in new tab)<\/span><\/a>, but with extended capabilities for semantic transformations involving dates and numbers as well as support for interactivity due to being part of PROSE. The Usage (opens in new tab)<\/span><\/a> page and the
\n
Transformation.Text<\/code> sample project (opens in new tab)<\/span><\/a> show examples of how to use the Transformation.Text API.<\/p>\n

Example Transformation<\/h2>\n

Given an example like:<\/p>\n\n\n\n\n\n
Input1<\/th>\nInput2<\/th>\nExample output<\/th>\n<\/tr>\n<\/thead>\n
Greta<\/td>\nHermansson<\/td>\nHermansson, G.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

Transformation.Text will generate a program to perform the same transformation given any other first name, last name pair:<\/p>\n\n\n\n\n\n\n\n
Input1<\/th>\nInput2<\/th>\nProgram output<\/th>\n<\/tr>\n<\/thead>\n
Kettil<\/td>\nHansson<\/td>\nHansson, K.<\/td>\n<\/tr>\n
Etelka<\/td>\nBala<\/td>\nBala, E.<\/td>\n<\/tr>\n
\u2026<\/td>\n\u2026<\/td>\n\u2026<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

 <\/p>\n","protected":false},"excerpt":{"rendered":"

Transformation.Text is a system that performs string transformations using examples allowing for many tasks involving strings to be performed automatically.<\/p>\n","protected":false},"featured_media":674232,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[13556,13554,13560],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-666933","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-computer-interaction","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[777094],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Usage","content":"The Transformation.Text<\/code> API is accessed through the Transformation.Text.Session<\/code> class. The primary methods are Constraints.Add()<\/code> which adds examples (or other constraints) to a session and Learn()<\/code> which and learns a Transformation.Text<\/code> program consistent with those examples. In order to use Transformation.Text<\/code>, you need assembly references to Microsoft.ProgramSynthesis.Transformation.Text.dll<\/code>,\r\nMicrosoft.ProgramSynthesis.Transformation.Text.Language.dll<\/code>, and\r\nMicrosoft.ProgramSynthesis.Transformation.Text.Semantics.dll<\/code>.\r\n

Basic usage<\/h2>\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nSession session = new Session();\r\nIEnumerable&lt;Example&gt; examples = new[]\r\n{\r\n new Example(new InputRow(\"Greta Hermansson\"), \"Hermansson, G.\")\r\n};\r\nsession.Constraints.Add(examples);\r\nProgram program = session.Learn();\r\nobject output = program.Run(new InputRow(\"Kettil Hansson\")); \/\/ output is \"Hansson, K.\"[\/code]\r\n\r\nThe examples are given as an IEnumerable<Example><\/code> with the input and the correct output. The input to\r\nTransformation.Text<\/code> is a row of a table of data which may include data from multiple columns. The InputRow<\/code> type lets you give a row as just a list of string<\/code>s without naming the columns. To get more control, implement Transformation.Text<\/code>'s IRow<\/code> interface.\r\n

One example with multiple strings<\/h3>\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"Greta\", \"Hermansson\"), \"Hermansson, G.\"))\r\nProgram program = session.Learn();\r\nstring output = program.Run(new InputRow(\"Kettil\", \"Hansson\")) as string; \/\/ output is \"Hansson, K.[\/code]\r\n\r\n(While the API types the output of running a Transformation.Text<\/code> program as object<\/code>, the output type will always be string<\/code> (or null<\/code>) in the current version. The cast to string<\/code> is done using as string<\/code> to acknowledge that future versions of Transformation.Text<\/code> may support other return types.)\r\n

Multiple examples<\/h3>\r\nTransformation.Text<\/code> can be given multiple examples in order to generate a program that will generalize over differently formatted inputs. In this example, we give Transformation.Text<\/code> a phone number to normalized in two different formats and it is able to take a phone number in a third similar format and normalize it as well.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nvar examples = new[]\r\n{\r\n new Example(new InputRow(\"212-555-0183\"), \"212-555-0183\"),\r\n new Example(new InputRow(\"(212) 555 0183\"), \"212-555-0183\")\r\n};\r\nsession.Constraints.Add(examples);\r\nProgram program = session.Learn();\r\nstring output = program.Run(new InputRow(\"425 311 1234\")) as string; \/\/ output is \"425-311-1234\"[\/code]\r\n\r\nIf your input data is in multiple formats, you will likely have to provide more than one example. A common workflow is\r\nto have the user give a small number of examples and then inspect the output (possibly with inputs to inspect suggested by the significant inputs feature) and have the option of providing additional examples if they discover an undesired result. The code for that workflow might look something like this:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"212-555-0183\"), \"212-555-0183\"));\r\nProgram program = session.Learn();\r\n\/\/ ... check program and find it is does not work as desired.\r\nsession.Constraints.Add(new Example(new InputRow(\"(212) 555 0183\"), \"212-555-0183\"));\r\nprogram = session.Learn();\r\nstring output = program.Run(new InputRow(\"425 311 1234\")) as string; \/\/ output is \"425-311-1234\"[\/code]\r\n\r\n
<\/div>\r\n

Inputs without known outputs<\/h2>\r\nMost likely, when learning a program, you will have some idea of other inputs you intend to run the program on in the future. Transformation.Text<\/code> can take those inputs and use them to help decide which program to return.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Inputs.Add(new InputRow(\"04\/02\/1962\"),\r\n new InputRow(\"27\/08\/1998\"));\r\nsession.Constraints.Add(new Example(new InputRow(\"02\/04\/1953\"), \"1953-04-02\"));\r\nProgram program = session.Learn();\r\nstring output = program.Run(\"31\/01\/1983\") as string; \/\/ output is \"1983-01-31\"[\/code]\r\n\r\n
<\/div>\r\n

Learning multiple programs<\/h2>\r\nThere are usually a large number of programs consistent with any given set of examples. Transformation.Text<\/code> has a ranking scheme which it uses to return the most likely program for the examples it has seen, but in some cases this may not be the desired program.\r\n\r\nLearnTopK()<\/code> has a parameter k<\/code> which specifies how many programs it should try to learn; it returns the top k<\/code>\u00a0ranked programs (or programs with the top k<\/code> ranks if there are ties).\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"Greta Hermansson\"), \"Hermansson\"));\r\n\/\/ Learn top 10 programs instead of just the single top program.\r\nIReadOnlyList&lt;Program&gt; programs = session.LearnTopK(k: 10);\r\nforeach (Program program in programs)\r\n{\r\n Console.WriteLine(program.Run(new InputRow(\"Kettil Hansson Smith\"))); \/\/ note that this input has a middle name\r\n}[\/code]\r\n\r\nThe first several programs output \u201cSmith\u201d, but after that one outputs \u201cHansson Smith\u201d. This could be used to ask the user which they meant or to do automated reranking of the top results based on some logic other than\r\nTransformation.Text<\/code>'s internal ranking system.\r\n\r\nTo specifically get the top distinct outputs, without needing to directly access the programs, use\r\nComputeTopKOutputsAsync()<\/code>:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"Greta Hermansson\"), \"Hermansson\"));\r\nIReadOnlyList&lt;object&gt; outputs = await session.ComputeTopKOutputsAsync(new InputRow(\"Kettil Hansson Smith\"), k: 10);\r\nforeach (object output in outputs)\r\n{\r\n Console.WriteLine(output);\r\n}[\/code]\r\n\r\n
<\/div>\r\n

Serializing programs<\/h2>\r\nSometimes you will want to learn a program in one session and run it on other data in a future session or transfer learned programs between computers. In order to do so, PROSE supports serializing programs:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nvar session = new Session();\r\nsession.Constraints.Add(new Example(new InputRow(\"Kettil Hansson\"), \"Hansson, K.\"));\r\nProgram program = session.Learn();\r\n\/\/ Programs can be serialized using .Serialize().\r\nstring serializedProgram = program.Serialize();\r\n\/\/ Serialized programs can be loaded in another program using the Transformation.Text API using .Load():\r\nProgram parsedProgram = Loader.Instance.Load(serializedProgram);\r\n\/\/ The program can then be run on new inputs:\r\nConsole.WriteLine(parsedProgram.Run(new InputRow(\"Etelka Bala\"))); \/\/ outputs \"Bala, E.\"[\/code]\r\n\r\n
<\/div>\r\n

API<\/h2>\r\n

Learning Transformation.Text<\/code> programs<\/h3>\r\nTo start, construct an empty Session<\/code> which encapsulates learning a program for a single task, often refined over the course of multiple learning calls.\r\n\r\nThe collection of all known inputs should be provided using .Inputs.Add()<\/code>. Transformation.Text<\/code> can make good use of around one hundred inputs; providing over a thousand may cause performance issues for some operations, although it will attempt to work on only a randomly selected sample when possible if too many inputs are provided. If selecting a subset of inputs to provide, they should be representative of the inputs the program will be run on. The inputs provided can be accessed using .Inputs<\/code> and removed using .RemoveInputs()<\/code> or RemoveAllInputs()<\/code>.\r\n\r\nThe main input to the learning procedure is a set of constraints<\/strong>, primarily examples, which are provided using .Constraints.Add()<\/code>. The following are common constraints used with Transformation.Text<\/code>:\r\n
    \r\n \t
  • Example<\/code><\/strong> (or Example<IRow, object><\/code>). The most common constraint. Asserts what the output should be for a specific input.<\/li>\r\n \t
  • DoesNotEqual<IRow, object><\/code><\/strong>. The opposite: for a specific input, gives a specific disallowed output.<\/li>\r\n \t
  • ColumnPriority<\/code><\/strong>. Used to specify which columns of the input to use. Useful if the IRow<\/code> implementation exposes many columns but only a few columns should be used by the program.<\/li>\r\n \t
  • OutputIs<IRow, object><\/code><\/strong>. Constrains the output to be of a specific semantic kind. Note that the .NET type of the output will still be string<\/code>; support for other .NET types in the output is expected in the future. The supported types for this constraint are NumberType<\/code>, PartialDateTimeType<\/code>, and `FormattedPartialDateTimeType.<\/li>\r\n \t
  • See the Transformation.Text.Constraints<\/code> namespace for other constraints.<\/li>\r\n<\/ul>\r\nSession<\/code> has three different methods for learning (plus \u201cAsync<\/code>\u201d variants):\r\n
      \r\n \t
    • Learn()<\/code>\/LearnAsync()<\/code> returns the single top-ranked program as a Program<\/code>.<\/li>\r\n \t
    • LearnTopK()<\/code>\/LearnTopKAsync()<\/code> takes an integer k<\/code> and returns the top-k<\/code> ranked programs as an\r\nIReadOnlyList<Program><\/code>.<\/li>\r\n \t
    • LearnAll()<\/code>]\/LearnAllAsync()<\/code> learns all programs consistent with the examples, giving the result compactly as a ProgramSet<\/code> (wrapped in an IProgramSetWrapper<\/code>).<\/li>\r\n<\/ul>\r\nTo run a Program<\/code>, use its Run()<\/code> method:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\npublic object Run(IRow input)\r\n[\/code]\r\n\r\nIf performance of running a single program on many inputs is an issue, then implementing the IIndexableRow<\/code> interface and using the Run(IIndexableRow)<\/code> variant may help."}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Vu Le","user_id":39174,"people_section":"Section name 0","alias":"levu"},{"type":"user_nicename","display_name":"Daniel Perelman","user_id":39453,"people_section":"Section name 0","alias":"danpere"},{"type":"user_nicename","display_name":"Clint Simon","user_id":40801,"people_section":"Section name 0","alias":"clsimon"},{"type":"user_nicename","display_name":"Ashish Tiwari","user_id":39171,"people_section":"Section name 0","alias":"astiwar"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/666933"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":14,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/666933\/revisions"}],"predecessor-version":[{"id":836656,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/666933\/revisions\/836656"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/674232"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=666933"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=666933"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=666933"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=666933"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=666933"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}