{"id":171415,"date":"2014-10-03T10:32:59","date_gmt":"2014-10-03T10:32:59","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/project\/nlpwin\/"},"modified":"2019-08-19T10:47:14","modified_gmt":"2019-08-19T17:47:14","slug":"nlpwin","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/nlpwin\/","title":{"rendered":"NLPwin"},"content":{"rendered":"
* on behalf of everyone who contributed to the development of NLPwin<\/p>\n
NLPwin is a software project at Microsoft Research that aims to provide Natural Language Processing tools for Windows (hence, NLPwin). The project was started in 1991, just as Microsoft inaugurated the Microsoft Research group; while active development of NLPwin continued through 2002, it is still being updated regularly, primarily in service of Machine Translation.<\/p>\n
NLPwin was and is still being used in a number of Microsoft products, among which the Index Server (1992-3), Word Grammar Checker (parsing every sentence to logical form since 1996), the English Query feature for SQL Server (SQL Server 1998 – 2000), natural language query interface for Encarta (1999, 2000), Intellishrink (2000), and of course, Bing Translator (opens in new tab)<\/span><\/a>.<\/p>\n Since we knew that we were developing NLPwin in part to support a grammar checker, the NLPwin grammar is designed to be broad-coverage (i.e., not domain-specific) and robust, in particular, robust to grammar errors. While most grammars are learned from data annotated on the\u00a0PennTreeBank (opens in new tab)<\/span><\/a>, it is interesting to consider that such grammars may not be able to parse ungrammatical or fragmented grammar, since those grammars have no training data for such input. The NLPwin grammar produces a parse for any input and if no spanning parse can be assigned, it creates a \u201cfitted\u201d parse, combining the largest constituents that it was able to construct.<\/p>\n The NLP rainbow: we envisioned that with ever more sophisticated analysis capabilities, it would be possible to create applications of a wide variety. As you can see below, the generation component was not well developed and we postulated NL applications for generation much as one hopes for a pot of gold at the end of the rainbow. Our first MT models transferred at the semantic level (papers through 2002), while today, our MT transfers primarily at the syntactic level, using a mixture of syntax-based and phrase-based models.<\/p>\n Figure 1: The NLP rainbow (1991), our original vision for NLP components needed and applications possible.<\/p><\/div>\n The architecture follows a pipeline approach, as shown in Figure 2,\u00a0where each component provides additional layers of analysis\/annotation of the input data. We designed the system to be relatively knowledge-poor in the beginning, while making use of richer and richer data sources as the need for more semantic information increased; one of our goals of this architecture is to preserve ambiguity until we either needed to resolve that ambiguity or the data resources existed to allow the resolution. Thus, the syntactic analysis proceeds in two steps: the syntactic sketch (which today might be described as a packed forest) and the syntactic portrait, where we \u201cunpack\u201d the forest and construct a constituent level of analysis which is syntactic, but also semantically valid. The constituency tree continues to be refined even during Logical Form processing as more global information can be brought to bear.<\/p>\n Figure 2: The NLPwin components and a schematic of their output representation.<\/p><\/div>\n A few points are worth making about the parser (a term which loosely combines the morphology, sketch and portrait modules). First, the parser is comprised of human authored rules. This will cause incredulity among those who are only familiar with machine-learned parsers that have been trained on the PennTreeBank (opens in new tab)<\/span><\/a>. It should be kept in mind that the NLPwin parser was constructed before the first parser was trained on the PennTreeBank, that the parser had to be fast (to support the grammar checker) and that grammar rule-writing was the norm pre-PennTreeBank grammars. Furthermore, the grammarian tasked with writing rules was supported by a sophisticated array of NLP developer tools (opens in new tab)<\/span><\/a> (created by George Heidorn), much as a programmer is now supported in Visual Studio, where grammar rules can be run to and from specific points in the code, variables can be changed interactively for exploration purposes, and most importantly, the developer environment supported running a suite of test files with interfaces for the grammarian to update the target files with improved parses. Secondly, the lead grammarian, Karen Jensen, broke with the implicit tradition where the constituent structure is implied by application of the parsing rules[1]. Jensen observed that binary rules are required to handle even common language phenomena such as free word order, and adverbial and prepositional phrase placement. Thus, in NLPwin, we use binary rules in an augmented phrase structure grammar formalism (APSG), computing the phrase structure as part of the actions of the rules, thereby creating nodes with unbounded modifiers, while maintaining binary rules, illustrated in Figure 3.<\/p>\n Figure 3: The derivation tree displays the history of rule application, while the computed tree provides a useful visualization of phrase structure.<\/p><\/div>\n Another important aspect of NLPwin is that it is the record structure, not the trees, that is the fundamental output of the analysis component (shown in Figure 4). Trees are merely a convenient form of display, using only 5 of the many attributes that make up the representation of the analysis (premodifiers (PRMODS), HEAD, postmodifiers (PSMODS), segment-type (SEGTYPE), and string value. Here is the record, a collection of attributes and values, for the node DECL1:<\/p>\n Figure 4: The record structure of any constituent is the heart of the NLPwin analysis.<\/p><\/div>\n Once the basic shape of the constituency tree has been determined, it is possible to compute what the Logical Form is. The goal of Logical Form is twofold: to compute the predicate-argument structure for each clause (\u201cwho did what to whom when where and how?\u201d) and to normalize differing syntactic realizations of what can be considered the same \u201cmeaning\u201d. In so doing, concepts that are possibly distant in the sentence and in the constituent structure can be brought together, in large part because the Logical Form is represented as a graph, where linear order is no longer primary. The Logical Form is a directed, labeled graph, where arcs are labeled with those relations that are defined to be semantic and surface words that convey syntactic information only are represented not as nodes in the graph but rather as annotations on the nodes, preserving their syntactic information (not shown in the graph representation below). Consider the following Logical Form:<\/p>\n Figure 5: A Logical Form example.<\/p><\/div>\n The Logical Form graph in Figure\u00a05 represents the direct connection between \u201celephants\u201d and \u201chave\u201d, which is interrupted by a relative clause at the surface syntax. Moreover, in analyzing the relative clause, the Logical Form has performed two operations: Logical Form normalizes the passive construction as well as assigns the referent of the relative pronoun \u201cwhich\u201d. Other operations commonly performed by Logical Form include (but are not limited to): unbounded dependencies, functional control, indirect object paraphrase, assigning modifiers.<\/p>\n Figure\u00a05\u00a0also demonstrates some of the shortcomings of Logical Form: 1) should \u201chave\u201d be a concept node in this graph or should it be interpreted as an arc labeled Part between \u201celephant\u201d and \u201ctusk\u201d? More generally: what should the inventory of relation labels be, and how should that inventory be determined? And 2) should we infer from this sentence only that \u201cAfrican elephants have been hunted\u201d and that \u201cAfrican elephants have large tusks\u201d, or can we infer that \u201celephants have been hunted\u201d and that they happen to be \u201cAfrican elephants\u201d. Deciding this question of scoping was postponed till discourse processing[2], when such questions may be addressed, and Logical Form does not represent the ambiguity in scoping.<\/p>\n During development of the NLPwin pipeline (see Figure 2), we considered that there would be a separate component determining word senses following the syntactic analysis of the input. This component was meant to select and\/or collate lexical information from multiple dictionaries to represent and expand the lexical meaning of each content word.\u00a0 This view on Word Sense Disambiguation (WSD) was in contrast to the then-nascent interest in WSD in the academic community, which formulated the WSD task as selecting one sense of a fixed inventory of word senses\u00a0as being correct. Our primary objection to this formulation\u00a0is that any fixed inventory will necessarily not be sufficient as the foundation for a broad-coverage grammar (see Dolan, Vanderwende and Richardson, 2000 (opens in new tab)<\/span><\/a>). For similar reasons, we elected to abandon the pursuit of assigning Word Senses in NLPwin as well. Today, the field has made great strides in exploring a more flexible notion of lexical\u00a0meaning with the advent of vector space, which would be promising to combine with the output of this parser.<\/p>\n While we did not view Word Sense Disambiguation as a separate task, we did design our parser and subsequent components to make use of ever richer lexical information. The sketch grammar relies on the subcategorization frames and other syntactic-semantic codes available from two dictionaries: Longman Dictionary of Contemporary English (LDOCE) and American Heritage Dictionary, 3rd<\/sup> edition, for which Microsoft acquired the digital rights. LDOCE in particular provides rich lexical information that facilitates the construction of Logical Form[3]. Such codes, rich as they are, do not support full semantic processing as is necessary when, for example, determining the correct attachment of prepositional phrases or nominal co-reference. The question was: is it possible to acquire such semantic knowledge automatically, in order to support a broad-coverage parser?<\/p>\n In the early to mid-90s, there was considerable interest in mining dictionaries and other reference works for semantic information broadly-speaking. For this reason, we envisioned that where lexical information was not sufficient to support the decisions that needed to be made in the Portrait component, we would acquire such information in machine readable reference works.<\/p>\n At the time, few broad-coverage parsers were available so the main thrust was to develop string patterns (regexes) that could be used to identify specific types of semantic information; Hearst (1992) describes the use of such patterns for the acquisition of Hypernymy (is-a terms). Alshawi (1989) parses dictionary definitions using a grammar especially designed for that dictionary (\u201cLongmanese\u201d).\u00a0 We encountered two concerns\u00a0about using\u00a0this approach: first, as the need for greater recall increases, writing and refining string patterns becomes more and more complex, in the limit, approaching the complexity of full-grammar writing and so\u00a0straying far from the straightforward string patterns you started with, and second, when extracting semantic relations beyond Hypernymy, we found string patterns to be insufficient (see Montemagni and Vanderwende 1992 (opens in new tab)<\/span><\/a>).<\/p>\n Instead, we proposed to parse the dictionary text using the linguistic components already developed, Sketch, Portrait and Logical Form, ensuring access to robust parsing, in order to bootstrap the knowledge acquisition of the semantic information needed to improve the Portrait. This bootstrapping is possible because some linguistic expressions are unambiguous, and so, at each iteration, we can extract from unambiguous text to improve the parsing of ambiguous text (see Vanderwende 1995 (opens in new tab)<\/span><\/a>).<\/p>\n As each definition in the dictionary and on-line encyclopedia was being processed and the semantic information was being stored for access by Portrait, a picture emerged from connecting all of the graph fragments. When viewed as a database rather than a look-up table (which is how people use dictionaries), the graph fragments are connected and interesting paths\/inferences emerge. To enrich the data further, we then took the step of viewing each graph fragment from the perspective of each content node. Imagine looking at the graph as a mobile and picking it up at each of the objects in turn – the nodes under the object remain the same, but the nodes above that object become inverted (illustrated in Figure 6). For example, for the definition of elephant<\/b>: :an animal with ivory tusks\u201d, MindNet stores not only the graph fragment \u201celephant PART (tusk MATR ivory)\u201d but also \u201ctusk PART-OF elephant\u201d and \u201civory MATR-OF tusk\u201d[4].<\/p>\n Figure 6: Logical Form and its inversions.<\/p><\/div>\n We called this collection of intersecting graphs MindNet. Figure 7\u00a0reflects the picture we saw for the word \u201cbird\u201d when looking at all of the pieces of information that were automatically gleaned from dictionary text:<\/p>\n Figure 7: A fragment NLPwin MindNet, centered on the word “bird”<\/p><\/div>\n As a person using only the dictionary, it would be very difficult to construct a list of all the different types of birds, all of the parts of a bird, all of the places that a bird may be found, or the types of actions that a bird may do. But by converting the dictionary to a\u00a0database, and inverting all the semantic relations as shown in Figure 6, MindNet contains rich semantic information for any concept that occurs in text, esp. because it is produced by automated methods using a broad-coverage grammar, a grammar that parses fragments as well as it parses complete grammatical input.<\/p>\n