{"id":801424,"date":"2021-12-09T14:02:58","date_gmt":"2021-12-09T22:02:58","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=801424"},"modified":"2021-12-09T14:58:06","modified_gmt":"2021-12-09T22:58:06","slug":"designing-a-framework-for-conversational-interfaces","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/designing-a-framework-for-conversational-interfaces\/","title":{"rendered":"Designing a Framework for Conversational Interfaces"},"content":{"rendered":"

This is a guest post from our close partners, Semantic Machines (opens in new tab)<\/span><\/a><\/em><\/p>\n

By Zachary Tellman<\/b><\/p>\n

Conversational interfaces are an idea that is forever on the cusp of transforming the world. The potential is undeniable: everyone has innate, untapped conversational expertise. We could do away with the nested menus required by visual interfaces; anything the user can name is immediately at hand. We could turn natural language into a declarative scripting language, and operating systems into an IDE.<\/p>\n

Reality, however, has not lived up to this potential. Most people’s use of the conversational agents in their phones and smart devices is limited to reminders and timers, if they use them at all. At Semantic Machines, however, we are creating a framework for conversational interfaces that we hope will unlock some of this potential. It’s currently powering a conversational interface in Outlook Mobile, with other products soon to follow.<\/p>\n

To accomplish this, our framework combines some of the latest advances in machine learning with concepts and approaches dating back to the earliest days of artificial intelligence research. To understand why, we’ll first need to look back fifty years to one of the first — and still one of the most successful — conversational agents ever created.<\/p>\n

A Detour Through History<\/h2>\n

In 1972, Terry Winograd published a paper entitled Understanding Natural Language<\/em> which described a software project he had worked on in the late 1960s. It allowed users to direct a virtual robot arm, named SHRDLU, to interact with a world consisting of a table, a box, and a few blocks of varying shapes and colors. Users could carry on a conversation with SHRDLU, asking questions and giving instructions:<\/p>\n

– Pick up a big red block.
\nOK.<\/p>\n

– Find a block which is taller than the one you are holding and put it into the box.
\nBY “IT”, I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING.
\nOK.<\/p>\n

– What does the box contain?
\nTHE BLUE PYRAMID AND THE BLUE BLOCK.<\/p>\n

– How many blocks are not in the box?
\nFOUR OF THEM.<\/p>\n

– Is at least one of them narrower than the one which I told you to pick up?
\nYES, THE RED CUBE.<\/p>\n

– Is it supported?
\nYES, BY THE TABLE.<\/p>\n

– Can the table pick up blocks?
\nNO.<\/p>\n

– Can a pyramid be supported by a block?
\nYES.<\/p>\n

– Can a pyramid support a pyramid?
\nI DON’T KNOW.<\/p>\n

– Stack up two pyramids.
\n(trying) I CAN’T.<\/p><\/blockquote>\n

Winograd’s project represents a pivotal point in the history of AI research. Earlier efforts were significantly more ambitious; Herbert Simon and Alan Newell’s \u201cGeneral Problem Solver,\u201d introduced in 1958, was presented not just as a method for achieving human-like behavior, but also as a descriptive model for human cognition. As became the norm for early AI research, Simon and Newell reduced the problem to one of search. Given an initial state and a desired end state, the Solver would search through all possible sequences of actions until it found one that led to that end state. Since the branching factor of the search tree would be very high — you can, in most situations, do almost anything — the Solver would need to use heuristics (from the Greek heureka<\/em>, as in “I’ve found it!”) to determine which actions were likely to be useful in a given situation.<\/p>\n

Having described the engine for thought, all that remained was \u201cknowledge engineering:\u201d creating a repository of possible actions and relevant heuristics for all aspects of human life. This, unfortunately, proved harder than expected. As various knowledge engineering projects stalled, researchers focused on problem solving within \u201cmicroworlds:\u201d virtual environments where the state was easily represented, and the possible actions easily enumerated. Winograd’s microworld was the greatest ever created; SHRDLU’s mastery of its environment, and the subset of the English language that could be used to describe it, was self-evident.<\/p>\n

Still, it wasn’t clear how to turn a microworld into something more useful; the boundaries of SHRDLU’s environment were relied upon at every level of its implementation. Hubert Dreyfus, a professor of philosophy and leading critic of early AI research, characterized these projects as “ad hoc solutions [for] cleverly chosen problems, which give the illusion of complex intellectual activity.” Ultimately, Dreyfus was proven right; every attempt to generalize or stitch together these projects failed.<\/p>\n

What came next is a familiar story: funding for research dried up in the mid-1970s, marking the beginning of the AI Winter. After some failed attempts in the 1980s to commercialize past research by selling so-called \u201cexpert systems,\u201d the field lay dormant for decades before the resurgence of the statistical techniques generally referred to as \u201cmachine learning.\u201d<\/p>\n

Generally, this era in AI research is seen as a historical curiosity; a group of researchers made wildly optimistic predictions about what they could achieve and failed. What could they possibly have to teach us? Surely it’s better to look forward to the bleeding edge of research than back at these abandoned microworlds.<\/p>\n

We must acknowledge, however, the astonishing sophistication of Winograd’s SHRDLU when compared to modern conversational agents. These agents operate on a model called \u201cslots and intents\u201d, which is effectively Mad-Libs in reverse. Given some text from the user (the utterance<\/b>), the system identifies the corresponding template (the intent<\/b>), and then extracts out pieces of the utterance (the slots<\/b>). These pieces are then fed into a function which performs the task associated with the intent.<\/p>\n

If, for example, we had a function order_pizza(size, toppings)<\/code>, a slots-and-intents framework can easily provide a mapping between “order me a medium pizza with pepperoni and mushrooms” and order_pizza(\"medium\", [\"pepperoni\", \"mushrooms\"])<\/code>. It allows us to separate linguistic concerns from the actual business logic required to order a pizza. But consider the second utterance from the conversation with SHRDLU:<\/p>\n

Find a block which is taller than the one you are holding and put it into the box.<\/p><\/blockquote>\n

This utterance is difficult to model as an intent for a number of reasons. It describes two actions, but since every intent maps onto a single function, we’d have to define a compound function find_block_and_put_into_box(...)<\/code> and define similar functions for any other compound action we’d want to support. But even that’s not enough; if we simply call find_block_and_put_into_box(\"taller than the one you are holding\")<\/code>, we’re letting linguistic concerns bleed into the business logic. At most, we’d want the business logic to be interpreting individual words like “taller,” “narrower,” and so on, but that would require an even more specific function:<\/p>\n

\r\nfind_block_which_is_X_than_held_block_and_put_in_box(\"taller\")\r\n<\/pre>\n

The problem is that natural language is compositional, while slots-and-intents frameworks are not. Rather than defining a set of primitives (“find a block,” “taller than,” “held block,” etc.) that can be freely combined, the developer must enumerate each configuration of these primitives they wish to support. In practice, this leads to conversational agents that are narrowly focused and easily confused.<\/p>\n

Winograd’s SHRDLU, despite its limitations, was far more flexible. At Semantic Machines we are building a dialogue system that will preserve that flexibility, while avoiding most of the limitations. This post will explain, at a high level, how we’ve accomplished that feat. If you find this problem space or our approach interesting, you should consider working with us (opens in new tab)<\/span><\/a>.<\/p>\n

Plans<\/h2>\n

In our dialogue system, utterances are translated into small programs, which for historical reasons (opens in new tab)<\/span><\/a> are called plans<\/b>. Given the problematic utterance:<\/p>\n

Find a block which is taller than the one you are holding and put it into the box.<\/p><\/blockquote>\n

Our planning model, which is a Transformer-based encoder-decoder neural network (opens in new tab)<\/span><\/a>, will return something like this:<\/p>\n

\r\nfind_block((b: Block) => taller_than(b, held_block()))\r\nput_in_box(the[Block]())\r\n<\/pre>\n

This is rendered in Express, an in-house language which is syntactically modeled after Scala. Notice that each symbol in the plan corresponds almost one-to-one with a part of the utterance, down to a special the()<\/code> function which resolves what “it” refers to. This is because we only want the planning model to translate<\/em> the utterance, not interpret<\/em> it.<\/p>\n

The reason for this isn’t immediately obvious; to most experienced developers, a function like taller_than<\/code> would seem like an unnecessary layer of indirection. Why not just inline it?<\/p>\n

\r\nfind_block((b: Block) => b.height > held_block().height)\r\n<\/pre>\n

This indirection, however, is valuable. In a normal codebase, function names aren’t exposed; we can assign them any meaning we like, so long as it makes sense to other people on our team. Conversely, these functions are an interface between our system and the user, and so their meaning is defined by the user’s intent. Over time, that meaning is almost certain to become more nuanced. We may, for instance, realize that when people say “taller than,” they mean noticeably<\/em> taller:<\/p>\n

\r\ndef taller_than(a: Block, b: Block) = (a.height - b.height) > HEIGHT_EPSILON\r\n<\/pre>\n

If we’ve maintained our layer of indirection, this is an easy one-line change to our function definition, and the training dataset for the planning model remains unchanged. If we’ve inlined the function, however, we have to carefully migrate our training dataset; we only want to update a.height > b.height<\/code> where it corresponds to “taller than” in the utterance.<\/p>\n

By focusing on translation, we keep our training data timeless, allowing our dataset to monotonically grow even as we tinker with semantics. By matching each natural language concept to a function, we keep our semantics explicit and consistent. This approach, however, assumes the meaning is largely context-independent. Our planning model is constrained by the language\u2019s type system, so if the utterance doesn’t mention blocks it won’t use block-related functions, but otherwise we assume that “taller than” can always be translated into taller_than<\/code>.<\/p>\n

This, of course, is untrue for indefinite articles like “it,” “that,” or “them;” their meaning depends entirely on what was said earlier in the conversation. In our system, all such references are translated into a call to the()<\/code>. This is possible because the Express runtime retains the full execution, including all intermediate results, of every plan in the current conversation (opens in new tab)<\/span><\/a>. This data, stored as a dataflow graph, represents our conversational context: things which we’ve already discussed, and may want to later reference. Certain special functions, such as the()<\/code>, can query that graph, searching for the expression which is being referenced.<\/p>\n

In SHRDLU, these indefinite articles were resolved during its parse phase, which transformed utterances into its own version of a plan. Resolution, however, is not always determined by the grammatical structure of the utterance; sometimes we need to understand its semantics. Consider these two commands:<\/p>\n