{"id":654579,"date":"2020-07-22T11:23:08","date_gmt":"2020-07-22T18:23:08","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=654579"},"modified":"2022-07-22T10:31:53","modified_gmt":"2022-07-22T17:31:53","slug":"prose-framework","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-framework\/","title":{"rendered":"PROSE Framework"},"content":{"rendered":"
Microsoft PROSE is a framework of technologies for programming by examples: automatic generation of programs from input-output examples.<\/p>\n
Given a domain-specific language (DSL) and input-output examples for the desired program\u2019s behavior, PROSE synthesizes a ranked set of DSL programs that are consistent with the examples.<\/p>\n
\t\t\t
NuGet package for automatically synthesizing programs from examples.<\/p>
\t<\/div>\n\t \t
A step-by-step walkthrough of the process of building a DSL in PROSE and enabling program synthesis for it.<\/p>
\t<\/div>\n\t \t
If you want to apply an existing PROSE DSL, check out its documentation and our samples<\/a>.<\/p> \t<\/div>\n\t<\/p>\t\t\t<\/div>\n\t\t<\/div>\n\t\t Read the release notes for the SDK.<\/p> \t<\/div>\n\t \t For questions,\u00a0email us<\/a>\u00a0or\u00a0open an issue on GitHub<\/a>.<\/p> \t<\/div>\n\t\t Get datasets for some DSLs in our SDK.<\/p> \t<\/div>\n\t<\/p>\t\t\t<\/div>\n\t\t<\/div>\n\t\t<\/p>\n","protected":false},"excerpt":{"rendered":" Microsoft PROSE SDK is a framework of technologies for programming by examples: automatic generation of programs from input-output examples.<\/p>\n","protected":false},"featured_media":674232,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13545,13560],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-654579","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-human-language-technologies","msr-research-area-programming-languages-software-engineering","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[641625,564657,335339,335864,453858],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Tutorial","content":"
\n\t\t\tTutorial<\/h2>\r\nThe core component of the PROSE SDK is its program synthesis framework for custom domain-specific languages (DSLs). It allows you to define a DSL that describes a typical space of tasks in your application domain, and automatically provides parsing, execution, and synthesis technologies for this DSL. Text transformation<\/a> and text extraction<\/a> DSLs are programming-by-example technologies that have been developed on top of the PROSE core synthesis framework.\r\n\r\nThis is a step-by-step tutorial on using the PROSE framework to create new DSLs<\/em>. To use an existing DSL in your programming-by-examples application, please check out its individual documentation link on the left.\r\n
DSL Structure<\/h3>\r\nA DSL consists of four main components:\r\n
\r\n \t
\r\n\r\nSee sections on: Syntax<\/strong> | Semantics<\/a> | Witness functions<\/a> | Features<\/a>\r\n\r\n
\r\n\r\nSyntax<\/h4>\r\nOur example DSL describes a space of programs that extract a substring from a given string. They can do it in two possible ways \u2013 either extract a substring based on absolute position indices, or based on matches of regular expressions.\r\n\r\nHere\u2019s the first version of the grammar of our DSL:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\n#reference 'file:SubstringExtraction.dll';\r\nusing semantics SubstringExtraction.Semantics;\r\nlanguage SubstringExtraction;\r\n\r\n@input string x;\r\n\r\n\/\/ Extract a substring from 'x' between positions 'posPair'\r\n@start string out := Substring(x, posPair);\r\nTuple <int?, int?> posPair := PositionPair(pos, pos)\r\n = Pair(pos, pos);\r\nint? pos := \/\/ A position at index 'k' (from the left if k >= 0, or from the right if k < 0)\r\n AbsolutePosition(x, k)\r\n\t\/\/ A position where two regexes 'positionBoundaries' match to left and to the right,\r\n\t\/\/ respectively, and it is the 'k'-th such position\r\n\t| RegexPosition(x, positionBoundaries, k);\r\nTuple <Regex, Regex> positionBoundaries := RegexPair(r, r)\r\n = Pair(r, r);\r\n\t\r\nRegex r;\r\nint k;\r\n[\/code]\r\n\r\nHere are some possible extraction programs contained in the
SubstringExtraction<\/code> DSL:\r\n
\r\n \t
\"a) Bread ($2.00)\"<\/code>). This problem can be addressed by a language extension.<\/li>\r\n<\/ul>\r\nIn general, a DSL consists of symbols<\/strong> and operators<\/strong> upon these symbols. In a context-free grammar, a DSL is represented as a set of rules<\/em>, where each symbol on the left-hand side is bound to a set of possible operators on the right-hand side that represent this symbol. Every operator must be pure<\/strong> \u2013 it should not have observable side effects. In other words, PROSE DSLs are functional \u2013 they operate upon immutable states.\r\n\r\nA program<\/strong> in a DSL transforms input<\/strong> data into output<\/strong> data. One terminal symbol in a DSL should be marked as
@input<\/code> \u2013 this is the input variable to all programs in this DSL. One nonterminal symbol in a DSL should be marked as
@start<\/code> \u2013 this is the root symbol for all programs in the DSL.\r\n\r\nA program is represented as an abstract syntax tree (AST)<\/strong> of the DSL operators. Each node in this tree (similarly, each DSL operator) has some invocation semantics. More formally, each AST node has a method
Invoke<\/code>. It takes as input a state<\/strong> σ and returns some output.\r\n\r\nA state is a mapping from DSL variables to their values. Initially, the topmost AST node invoked on a state with a single variable binding for the DSL\u2019s input variable<\/em>.\r\n\r\nHere\u2019s how you can parse and execute a program in our
SubstringExtraction<\/code> DSL:\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nusing Microsoft.ProgramSynthesis;\r\nusing Microsoft.ProgramSynthesis.AST;\r\nusing Microsoft.ProgramSynthesis.Compiler;\r\n\r\n\/\/ Parse the DSL grammar above, saved in a .grammar file\r\nvar grammar = DSLCompiler.ParseGrammarFromFile(\"SubstringExtraction.grammar\").Value;\r\n\/\/ Parse a program in this grammar. PROSE supports two serialization formats:\r\n\/\/ \"human-readable\" expression format, used in this tutorial, and machine-readable XML.\r\nvar ast =\r\n ProgramNode.Parse(\"Substring(x, PositionPair(AbsolutePosition(x, 0), AbsolutePosition(x, 5)))\",\r\n grammar, ASTSerializationFormat.HumanReadable);\r\n\/\/ Create an input state to the program. It contains one binding: a variable 'x' (DSL input)\r\n\/\/ is bound to the string \"PROSE Rocks\".\r\nvar input = State.Create(grammar.InputSymbol, \"PROSE Rocks\");\r\n\/\/ Execute the program on the input state.\r\nvar output = (string) ast.Invoke(input);\r\nAssert(output == \"PROSE\");\r\n[\/code]\r\n\r\nA
ParseGrammarFromFile<\/code> invocation returns a
Result<Grammar><\/code>, which either holds a valid parsed grammar in its
Value<\/code> property, or a collection of errors\/warnings in its
Diagnostics<\/code> property. You can quickly examine all diagnostics by calling
result.TraceDiagnostics()<\/code>.\r\n\r\nAt this moment grammar parsing will fail since we haven\u2019t defined any execution semantics for the operators in our DSL, only its syntax. Let\u2019s fix that.\r\n
\r\n\r\nSemantics<\/h4>\r\nAn executable semantics<\/em> for an operator\u00a0F<\/strong> in PROSE is a .NET function that matches the signature of F<\/strong>. You need to provide it for every operator in your DSL that is not imported from the standard library of PROSE or another language. In our example, such operators are
Substring<\/code>,
AbsolutePosition<\/code>, and
RegexPosition<\/code>.\r\n\r\nAll semantics functions should be defined as static methods. Static classes where PROSE searches for such functions (called semantics holders<\/em>) are indicated in the grammar with a
using semantics<\/code> declaration. A DSL may contain multiple
using semantics<\/code> declarations, but each operator should correspond to exactly one semantics function with the same name and signature in one of semantics holders.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nusing System.Collections.Generic;\r\nusing System.Linq;\r\nusing System.Text.RegularExpressions;\r\n\r\nstatic class Semantics\r\n{\r\n static string Substring(string x, Tuple<int?, int?> posPair)\r\n {\r\n if (posPair.Item1 == null || posPair.Item2 == null)\r\n return null;\r\n int start = posPair.Item1.Value;\r\n int end = posPair.Item2.Value;\r\n if (start < 0 || start >= x.Length ||\r\n end < 0 || end >= x.Length || end < start)\r\n return null;\r\n return x.Substring(start, end - start);\r\n }\r\n\r\n static int? AbsolutePosition(string x, int k)\r\n {\r\n if (k > x.Length || k < -x.Length - 1)\r\n return null;\r\n return k >= 0 ? k : (x.Length + k + 1);\r\n }\r\n\r\n static int? RegexPosition(string x, Tuple<Regex, Regex> regexPair, int occurrence)\r\n {\r\n if (regexPair.Item1 == null || regexPair.Item2 == null)\r\n return null;\r\n Regex left = regexPair.Item1;\r\n Regex right = regexPair.Item2;\r\n var rightMatches = right.Matches(x).Cast<Match>().ToDictionary(m => m.Index);\r\n var matchPositions = new List<int>();\r\n foreach (Match m in left.Matches(x))\r\n {\r\n if (rightMatches.ContainsKey(m.Index + m.Length))\r\n matchPositions.Add(m.Index + m.Length);\r\n }\r\n if (occurrence >= matchPositions.Count ||\r\n occurrence < -matchPositions.Count)\r\n return null;\r\n return occurrence >= 0\r\n ? matchPositions[occurrence]\r\n : matchPositions[matchPositions.Count + occurrence];\r\n }\r\n}\r\n[\/code]\r\n\r\nThis code illustrates several important points that you should keep in mind when designing a DSL:\r\n
\r\n \t
null<\/code> instead. In PROSE,
null<\/code> is a valid value with a meaning \u201ccomputation failed\u201d.<\/li>\r\n \t
x<\/code> is a parameter for both
AbsolutePosition<\/code> and
RegexPosition<\/code>.<\/li>\r\n<\/ul>\r\n
Note:<\/strong> The
dslc<\/code> grammar compiler uses reflection to find definitions of external components of a grammar, such as semantics functions. It searches over the assemblies specified with
reference<\/code> statements in the grammar. Those assemblies must be built and present at given locations when you execute
dslc<\/code> (in a command-line or API form). If you build your semantics functions and your grammar definition in the same solution, make sure to separate them into different projects and make the grammar project depend on the semantics project, so that the latter one is built first.\r\n\r\nIf you generate your project template using our Yeoman generator<\/a>, its projects will be pre-arranged in this fashion automatically.<\/blockquote>\r\nSyntax and semantics above constitute a minimal DSL definition. They are sufficient for our little parsing\/execution sample to work. Let\u2019s proceed now to synthesizing programs in this DSL.\r\n\r\n
\r\n\r\nSynthesis<\/h4>\r\nPROSE comes with a default synthesis strategy which we call deductive backpropagation<\/a>. It also enables researchers in the field of synthesis to develop their own strategies on top of its common API. However, in this tutorial we explain how to use backpropagation for synthesis of programs in our
SubstringExtraction<\/code> DSL.\r\n\r\nProgram synthesis starts with a specification: what do we want from a desired program? In PROSE, specifications are inductive<\/em>: they specify an output of a desired program on some input state, or, more generally, some property of this output.\r\n\r\nIn this tutorial, we start with the simplest form of such a spec \u2013
ExampleSpec<\/code>, an input-output example. Given a spec, we invoke a learning session on it, generating a set of programs in the DSL that are consistent with the input-output examples in the spec.\r\n\r\n[code autolinks=\"true\" highlight=\"\"]\r\nusing Microsoft.ProgramSynthesis.Specifications;\r\nusing Microsoft.ProgramSynthesis.Learning;\r\n\r\nvar input = State.Create(grammar.InputSymbol, \"PROSE Rocks\");\r\nstring desiredOutput = \"PROSE\";\r\nvar spec = new ExampleSpec(new Dictionary<State, object> { [input] = desiredOutput });\r\nvar engine = new SynthesisEngine(grammar);\r\nProgramSet learned = engine.LearnGrammar(spec);\r\nAssert(learned?.Size > 0);\r\n[\/code]\r\n\r\nAt this moment the learning result will be empty. The reason for this is that PROSE does not have information about your DSL to perform any kind of reasoning over it. For instance, terminal symbols
k<\/code> and
r<\/code> should be replaced with literal
int<\/code> or
Regex<\/code> constants, respectively, in each generated program. However, they are seemingly unbounded: any integer or regular expression in the world could possibly be present in a desired program, thus our search space is effectively infinite.\r\n\r\nWhat about the specification, then?\r\nAn input-output example that we provided drastically restricts the search space size. For instance, the input string
\"PROSE Rocks\"<\/code> is 11 characters long, hence any absolute position extraction logic
AbsolutePosition(x, k)<\/code> with\u00a0k<\/strong> > 12 or\u00a0k<\/strong> < -11 cannot be consistent with the spec. What we just did here was backwards reasoning over the DSL structure: we deduced a constraint on
k<\/code> in a desired program from a constraint on the entire program.\r\n\r\nTo do that, we essentially inverted the semantics of
AbsolutePosition<\/code><\/em>, deducing its inputs (or their properties) given the output. In PROSE, such a procedure is called a witness function<\/em>, and it is a surprisingly simple way to specify immensely powerful hints for the learning process.\r\n