Microsoft Search, Assistant and Intelligence Articles

PyMarlin: A lightweight library that improves deep learning training agility

Microsoft Research Team — Thu, 16 Dec 2021 22:47:39 +0000

By Amin Saied, Ananth Rao, Ashwin Srinivasan, Damien Jose, Eduardo Gonzalez, Han Yu, Jon Sleep, Krishan Subudhi, Shruti Gullapuram

PyMarlin (opens in new tab) is a lightweight PyTorch (opens in new tab) extension library for agile experimentation. It was designed with the goal of simplifying the end-to-end deep learning experimentation lifecycle, agnostic of the compute environment. In July 2021, the PyMarlin team open-sourced their internal model training library to all PyTorch users. PyMarlin abstracts out all the boilerplate code for scaling, logging, and argument parsing that are crucial for training deep learning-based models. PyMarlin can be thought of as a high-level abstraction over PyTorch. We have created a five-minute “Getting Started (opens in new tab)” module for anyone interested in trying out PyMarlin. Today we’ll look at how PyMarlin works, how it supports extensibility, and next steps needed to advance its functionality further.

How the typical deep learning training lifecycle works

Figure 1: Typical deep learning training steps

These three steps (and their sub-steps) are the backbone of any typical deep learning model training lifecycle. But this process also involves writing a lot of code and testing. Since scientists and researchers focus mostly on the model training part, they generally write other components without following any design pattern. This makes the training code difficult to extend.

For example, let’s say a researcher has written a code for text summarization, including all the code necessary for scaling and logging. A fellow researcher wants to try out a new optimizer. Another colleague wants to experiment with new evaluation metrics and loss functions. And yet another scientist wants to use the same recipe but on different data. In this case, all the stakeholders make separate copies of the code and make their own modifications. But then, suppose the original researcher changes the encoder and decoder architecture and comes up with a better original model. Other stakeholders may have to change their ML code. What a waste of everyone’s time!

While speeding up the training using Distributed Data Parallel (DDP) and Mixed Precision, bugs can be introduced, too. For example, by using multiple GPUs and multiple nodes, batch size per GPU must be reduced to maintain the same global batch size. This can involve manual and erroneous calculation of minibatch size or number of gradient accumulation steps. Additionally, during the validation step, the outputs from multiple GPUs need to be gathered to calculate evaluation metrics accurately. Finally, adding an optimization, such as disabling all reduce during gradient accumulation, can speed up the model. In mixed precision training using PyTorch’s native amp module, gradients must be unscaled before they can be clipped.

There are many open source libraries that provide functionality similar to PyMarlin. In fact, some of them have extra features which can come in quite handy. The Hugging face trainer supports other logging frameworks like wandb, but it is not model agnostic. However, PyMarlin offers unique benefits. We focused on keeping the code simple and easily readable. PyMarlin is not designed to be a black box. Power users will be able to understand PyMarlin’s code and extend it as necessary.

PyMarlin at a glance

A brief look at the architecture^[1] (opens in new tab)

PyMarlin has four core components: DataProcessor and DataInterface, Module Interface, Trainer Backend, and Trainer. First, we’ll look at the DataProcessor and DataInterface (opens in new tab). The role of DataProcessor and DataInterface is to decouple data processing and dataset building from model training. The DataProcessor processes and optionally analyzes data. Users can have multiple data processors which can be chained together. DataInterface has abstract methods which the ModuleInterface calls to obtain train and validation datasets during training.

The Module Interface (opens in new tab) is where scientists and researchers write their training code. This module can be thought of as the implementation of the training recipe. Module Interface inherits from nn.Module and hence can be treated like any PyTorch module.

The Trainer Backend (opens in new tab) is responsible for training/validating the Module Interface for one entire epoch. PyMarlin offers various useful backend implementations, such as SingleProcess, SingleProcessAmp, and DDPTrainerBackend.

Finally, the Trainer (opens in new tab) serves as the bridge between the Trainer Backend and Module Interface: it takes care of device management, rank fetching, checkpointing, and reloading. It also handles all the calculations for mini batch size, gradient accumulation, number of remaining epochs, initializing stats writers like tensor board, and restarting training from previous state.

Figure 2: Steps to follow while writing code using PyMarlin

A PyMarlin Deep Dive

Although PyMarlin has four core components, it has additional features to assist coders. We’ll first explore the core components in greater depth, then look at the supporting features.

DataProcessor and DataInterface
The DataProcessor modules aim to support most large-scale preprocessing requirements. DataProcessor can be seen as a single step of processing, such as reading files. Multiple DataProcessors can be used sequentially, each covering a preprocessing step. Once the business logic is added in process() function, inbuilt multiprocessing support can be easily leveraged.The business logic in the DataProcessor’s process function can be invoked either on a single compute, locally, or even as a distributed job across nodes. It comes with built-in support for AML. It also allows for selective preprocessing. For example, with a large dataset you could decide how many parts it should be split into and choose the part to be processed at a time on a single node.

This (opens in new tab) example (opens in new tab) covers a pre-processing example with raw Wikipedia data. In that example, which splits sentences for 27 Wikipedia raw text files, we see the following time savings:
- Single node, without multi-processing: 2.5 hours
- Single node, with multi-processing: 20 minutes
- Multi-node (4), with multi process: 13 minutes
ModuleInterface (opens in new tab)
This interface contains model architecture in the form of a PyTorch nn.Module together with optimizers and schedulers, train and validation step recipes, and any callbacks. Scientists need to implement the abstract functions to create a training recipe. This recipe can be further extended, too. In general, ModuleInterface takes a DataInterface instance as input. DataInterface is called upon to return the datasets which ModuleInterface uses to create DataLoaders. The forward function is overridden and replaced with two functions — train_step and val_step — to differentiate training and validation loop code. ModuleInterface also inherits from CallBackInterface. Users can optionally override callbacks like on_end_val_epoch() to calculate metrics. We created an example ModuleInterface (opens in new tab) for further reference.
TrainerBackend
In PyMarlin, we’ve made distributed (DDP specifically) training, as well as FP16 training, easy by implementing them as backend trainers. You can use them by setting the trainer backend string (this can also be done similarly by setting trainer.backend in the YAML config) as follows:
```
trainer = Trainer(module_interface=MyModuleInterface(…), backend=”ddp-amp”)
train()
```
Behind the scenes, many things are happening, such as loss scaling for FP16 and distributed output collection that would otherwise require dozens of lines of boilerplate code copy-pasted from scenario to scenario. Each trainer backend file is separated for modularity. Having multiple trainer backends makes the code extendable and clutter-free. Currently we support the following backends:
1. SingleProcess (sp)
  To train in single cpu or gpu
2. SingleProcessAmp (sp-amp)
  To train using single gpu and mixed precision. Recommended for V100 or A100 GPUs.
3. SingleProcessApexAmp (sp-amp-apex)
  Same as SingleProcessAmp but uses nvidia apex library instead of native amp.
4. DDPTrainerBackend
  A decorator that can convert any of the other backends to work in distributed data parallel setting.
  backend = DDPTrainerBackend(SingleProcessAmp())
More information can be found in the documentation (opens in new tab).
Trainer
The Trainer is responsible for coordinating the model definition (ModuleInterface) and the TrainerBackend, connecting the high-level model recipe with the backend on which it will be trained. The Trainer can scale to multiple processes. It automatically handles fetching ranks to scale PyMarlin training from one GPU to multiple GPUs using torch.distributed.launch, or AzureML MPI from the environment variables, and passes it to the backend as shown in Figure 3.Other frameworks can also be integrated easily. For example, if users are spawning multiple processes using a custom script or a framework other than Azure ML, they can write a function to fetch the ranks and create an instance of DistributedTrainingArguments (opens in new tab) and pass it as a TrainerArgument.

The Trainer also allows users to move ModuleInterface to their device. After fetching the ranks, PyMarlin moves the ModuleInterface to the local_rank GPU. Inputs to ModuleInterface’s train_step and val_step are not moved to the device; the user is responsible to move them. Users can extend and modify the ‘to’ function to change the device movement behavior. `To` is a torch.nn.Module function, hence proper care must be taken before overriding this function. Model parallelism is not supported out of the box but can be achieved by writing custom code for rank fetching and model movement.

Figure 3: Trainer lifecycle
Stats and Loggers
We have implemented a wrapper on Tensorboard’s SummaryWriter for logging stats to Tensorboard (TB), which makes it easy to use the utility to save TB events and visualize on TB later, for tracking the progress of your training experiment. We also have the Azure ML and stdout writers to be able to write out your stats to the logs. Users can create their own writers and pass to the trainer. Currently PyMarlin supports three writers out of the box: StdOut, Tensorboard, and AML.
Checkpointer
Checkpointing is made simple with a built-in checkpoint utility module that offers a default implementation which saves the state of the ModuleInterface, TrainerBackend, and Trainer at every epoch, and loads any model checkpoints available at the start of training. Users can control via arguments the save and load directories, as well as customize the frequency of checkpointing, and easily perform tasks such as resuming training after any number of steps, because optimizers and schedulers and additional information are stored as a part of the default checkpointing. However, in line with the goal of offering flexibility and extensibility, users can implement their own checkpointers by extending the Abstract Checkpointer class for custom checkpointing logic. As shown in Figure 4, users can also save any states or variables from the three core classes mentioned by overwriting the get_state() and update_state() methods from each class, which are called upon by the checkpointer at each save() and load() respectively. For an example on how to implement a custom checkpointer, please visit our documentation site (opens in new tab).

Figure 4: Checkpointing design
AML integration
AML integration is out of the box using Azure ML (MPI) launcher. Trainer fetches ranks from the MPI environment variables, which are set if Azure ML is used to spawn the nodes and processes. You can use the right backend (DDP, DDP-AMP) to ensure distributed training on AML compute.
Yaml parser
It’s hard to keep track of the many hyperparameters in deep learning experiments. To ease this, we offer a custom arguments parser that allows you to maintain a YAML file containing all parameters. Values which need to be overriden during experimentation can also be passed via command line. Example: If this is your YAML file with following configs:
```
trainer:
    backend: "sp"
    train_batch_size: 32
    val_batch_size: 16
    epochs: 1 # Total epochs to run.
```
You can modify this while running the script via command line as:

python Myscript.py --trainer.backend “sp-amp”

The simplest example of PyMarlin in action is for CIFAR Image Classification, for which we have a Collab notebook. We recommend following along there: CIFAR.ipynb – Colaboratory (google.com) (opens in new tab). The notebook goes through the major steps of the workflow for creating a PyMarlin Scenario. These are:
- Data preprocessing and analysis (through implementing DataProcessor and DataInterface)
- Defining the model and training setup (optimizers,LR Scheduler), and validation metrics through ModuleInterface
- Start training by initializing instances of DataInterface, ModuleInterface and Trainer and executing Trainer.train()

Extensibility

The biggest benefit of PyMarlin is its extensibility.

You can change the dataset but keep the training recipe the same (CIFAR to MNIST). For example, to reuse a great Image Classification model for another task, you can keep your ModuleInterface almost the same and just implement a new data interface. For switching to MNIST from our CIFAR example, this is simple with torchvision extension package. Changes from CIFAR are highlighted:

from pymarlin.core import data_interface
class MNISTDataProcessor(data_interface.DataProcessor):
    def process(self):
        '''
        Downloads and caches the CIFAR data.
        Normalizes the data and creates torch datasets
        '''
        transform = transforms.Compose(
            [transforms.ToTensor(),
             transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
        trainset = torchvision.datasets.MNIST(root='./data', train=True,
                                        download=True, transform=transform)
        testset = torchvision.datasets.MNIST(root='./data', train=False,
                                       download=True, transform=transform)
        self.datasets = {'Train': trainset, 'Test': testset}
        return self.datasets

    def analyze(self):
        '''
        Displays size of train and test data sets.
        prints few images and their labels from train dataset
        '''
        datasets = self.datasets
        print(f'train data size = {len(datasets["Train"])}')
        print(f'val data size = {len(datasets["Test"])}')
        print('Examples')
        random_indices = np.random.choice(range(len(datasets['Train'])),5, False)
        sample_images = [datasets['Train'][i][0] for i in random_indices]
        sample_labels = [datasets['Train'][i][1] for i in random_indices]
        self._imshow(torchvision.utils.make_grid(sample_images))
        classes = [str(digit) for digit in range(10)]
        print('| '.join('%5s' % classes[sample_labels[j]] for j in range(len(sample_labels))))
        
    def _imshow(self,img):
        img = img / 2 + 0.5     # unnormalize
        npimg = img.numpy()
        plt.figure(figsize = (10,5))
        plt.imshow(np.transpose(npimg, (1, 2, 0))) # height x width x channels
        plt.show()

You can also change the model architecture but keep the data the same. Using the CIFAR example above, this would still be incredibly simple with torchvision, assuming we’d keep the optimizer and everything else the same:

 from torchvision.models import resnet18
	 def __init__(self, data_interface):
        super().__init__() # always initialize superclass first
        self.data_interface = data_interface
       
  self.net = resnet18(pretrained=True)

All PyMarlin code is modular and extendable. We encourage the Open Source community to contribute new additions to the library. The BART CNN/DailyMail Summarization example (opens in new tab) shows the process for creating a new trainer backend for ONNXRuntimeTraining: PyMarlin/ORT_README.md at main · microsoft/PyMarlin (github.com) (opens in new tab).

Future Roadmap

The PyMarlin team plans to support more trainer backends that will enable users wanting to train large models through DeepSpeed model parallelism. Differential privacy training is another feature we plan to support. We are always looking for ways to further expand and improve PyMarlin and we welcome contributions from the community! You can find our external contribution guidelines here (opens in new tab).

PyMarlin team – we’re hiring

We are a group of applied scientists and engineers (Amin Saied, Ananth Rao, Ashwin Srinivasan, Damien Jose, Eduardo Gonzalez, Han Yu, Jon Sleep, Krishan Subudhi, Shruti Gullapuram, Alejandro Stevenson-Duran, Manash Goswami) from Microsoft Office and Azure who are enthusiastic about running extensible and scalable code and working on large-scale language model pretraining and finetuning for enterprise scenarios. If this type of work interests you, the PyMarlin team in MSAI (opens in new tab) is hiring both scientists and engineers! Please visit our careers page (opens in new tab).

¹ (opens in new tab)More information can be found in our documentation (opens in new tab)

The post PyMarlin: A lightweight library that improves deep learning training agility appeared first on Microsoft Research.

Designing a Framework for Conversational Interfaces

Microsoft Research Team — Thu, 09 Dec 2021 22:02:58 +0000

This is a guest post from our close partners, Semantic Machines (opens in new tab)

By Zachary Tellman

Conversational interfaces are an idea that is forever on the cusp of transforming the world. The potential is undeniable: everyone has innate, untapped conversational expertise. We could do away with the nested menus required by visual interfaces; anything the user can name is immediately at hand. We could turn natural language into a declarative scripting language, and operating systems into an IDE.

Reality, however, has not lived up to this potential. Most people’s use of the conversational agents in their phones and smart devices is limited to reminders and timers, if they use them at all. At Semantic Machines, however, we are creating a framework for conversational interfaces that we hope will unlock some of this potential. It’s currently powering a conversational interface in Outlook Mobile, with other products soon to follow.

To accomplish this, our framework combines some of the latest advances in machine learning with concepts and approaches dating back to the earliest days of artificial intelligence research. To understand why, we’ll first need to look back fifty years to one of the first — and still one of the most successful — conversational agents ever created.

A Detour Through History

In 1972, Terry Winograd published a paper entitled Understanding Natural Language which described a software project he had worked on in the late 1960s. It allowed users to direct a virtual robot arm, named SHRDLU, to interact with a world consisting of a table, a box, and a few blocks of varying shapes and colors. Users could carry on a conversation with SHRDLU, asking questions and giving instructions:

– Pick up a big red block.
OK.

– Find a block which is taller than the one you are holding and put it into the box.
BY “IT”, I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING.
OK.

– What does the box contain?
THE BLUE PYRAMID AND THE BLUE BLOCK.

– How many blocks are not in the box?
FOUR OF THEM.

– Is at least one of them narrower than the one which I told you to pick up?
YES, THE RED CUBE.

– Is it supported?
YES, BY THE TABLE.

– Can the table pick up blocks?
NO.

– Can a pyramid be supported by a block?
YES.

– Can a pyramid support a pyramid?
I DON’T KNOW.

– Stack up two pyramids.
(trying) I CAN’T.

Winograd’s project represents a pivotal point in the history of AI research. Earlier efforts were significantly more ambitious; Herbert Simon and Alan Newell’s “General Problem Solver,” introduced in 1958, was presented not just as a method for achieving human-like behavior, but also as a descriptive model for human cognition. As became the norm for early AI research, Simon and Newell reduced the problem to one of search. Given an initial state and a desired end state, the Solver would search through all possible sequences of actions until it found one that led to that end state. Since the branching factor of the search tree would be very high — you can, in most situations, do almost anything — the Solver would need to use heuristics (from the Greek heureka, as in “I’ve found it!”) to determine which actions were likely to be useful in a given situation.

Having described the engine for thought, all that remained was “knowledge engineering:” creating a repository of possible actions and relevant heuristics for all aspects of human life. This, unfortunately, proved harder than expected. As various knowledge engineering projects stalled, researchers focused on problem solving within “microworlds:” virtual environments where the state was easily represented, and the possible actions easily enumerated. Winograd’s microworld was the greatest ever created; SHRDLU’s mastery of its environment, and the subset of the English language that could be used to describe it, was self-evident.

Still, it wasn’t clear how to turn a microworld into something more useful; the boundaries of SHRDLU’s environment were relied upon at every level of its implementation. Hubert Dreyfus, a professor of philosophy and leading critic of early AI research, characterized these projects as “ad hoc solutions [for] cleverly chosen problems, which give the illusion of complex intellectual activity.” Ultimately, Dreyfus was proven right; every attempt to generalize or stitch together these projects failed.

What came next is a familiar story: funding for research dried up in the mid-1970s, marking the beginning of the AI Winter. After some failed attempts in the 1980s to commercialize past research by selling so-called “expert systems,” the field lay dormant for decades before the resurgence of the statistical techniques generally referred to as “machine learning.”

Generally, this era in AI research is seen as a historical curiosity; a group of researchers made wildly optimistic predictions about what they could achieve and failed. What could they possibly have to teach us? Surely it’s better to look forward to the bleeding edge of research than back at these abandoned microworlds.

We must acknowledge, however, the astonishing sophistication of Winograd’s SHRDLU when compared to modern conversational agents. These agents operate on a model called “slots and intents”, which is effectively Mad-Libs in reverse. Given some text from the user (the utterance), the system identifies the corresponding template (the intent), and then extracts out pieces of the utterance (the slots). These pieces are then fed into a function which performs the task associated with the intent.

If, for example, we had a function order_pizza(size, toppings), a slots-and-intents framework can easily provide a mapping between “order me a medium pizza with pepperoni and mushrooms” and order_pizza("medium", ["pepperoni", "mushrooms"]). It allows us to separate linguistic concerns from the actual business logic required to order a pizza. But consider the second utterance from the conversation with SHRDLU:

Find a block which is taller than the one you are holding and put it into the box.

This utterance is difficult to model as an intent for a number of reasons. It describes two actions, but since every intent maps onto a single function, we’d have to define a compound function find_block_and_put_into_box(...) and define similar functions for any other compound action we’d want to support. But even that’s not enough; if we simply call find_block_and_put_into_box("taller than the one you are holding"), we’re letting linguistic concerns bleed into the business logic. At most, we’d want the business logic to be interpreting individual words like “taller,” “narrower,” and so on, but that would require an even more specific function:

find_block_which_is_X_than_held_block_and_put_in_box("taller")

The problem is that natural language is compositional, while slots-and-intents frameworks are not. Rather than defining a set of primitives (“find a block,” “taller than,” “held block,” etc.) that can be freely combined, the developer must enumerate each configuration of these primitives they wish to support. In practice, this leads to conversational agents that are narrowly focused and easily confused.

Winograd’s SHRDLU, despite its limitations, was far more flexible. At Semantic Machines we are building a dialogue system that will preserve that flexibility, while avoiding most of the limitations. This post will explain, at a high level, how we’ve accomplished that feat. If you find this problem space or our approach interesting, you should consider working with us (opens in new tab).

Plans

In our dialogue system, utterances are translated into small programs, which for historical reasons (opens in new tab) are called plans. Given the problematic utterance:

Find a block which is taller than the one you are holding and put it into the box.

Our planning model, which is a Transformer-based encoder-decoder neural network (opens in new tab), will return something like this:

find_block((b: Block) => taller_than(b, held_block()))
put_in_box(the[Block]())

This is rendered in Express, an in-house language which is syntactically modeled after Scala. Notice that each symbol in the plan corresponds almost one-to-one with a part of the utterance, down to a special the() function which resolves what “it” refers to. This is because we only want the planning model to translate the utterance, not interpret it.

The reason for this isn’t immediately obvious; to most experienced developers, a function like taller_than would seem like an unnecessary layer of indirection. Why not just inline it?

find_block((b: Block) => b.height > held_block().height)

This indirection, however, is valuable. In a normal codebase, function names aren’t exposed; we can assign them any meaning we like, so long as it makes sense to other people on our team. Conversely, these functions are an interface between our system and the user, and so their meaning is defined by the user’s intent. Over time, that meaning is almost certain to become more nuanced. We may, for instance, realize that when people say “taller than,” they mean noticeably taller:

def taller_than(a: Block, b: Block) = (a.height - b.height) > HEIGHT_EPSILON

If we’ve maintained our layer of indirection, this is an easy one-line change to our function definition, and the training dataset for the planning model remains unchanged. If we’ve inlined the function, however, we have to carefully migrate our training dataset; we only want to update a.height > b.height where it corresponds to “taller than” in the utterance.

By focusing on translation, we keep our training data timeless, allowing our dataset to monotonically grow even as we tinker with semantics. By matching each natural language concept to a function, we keep our semantics explicit and consistent. This approach, however, assumes the meaning is largely context-independent. Our planning model is constrained by the language’s type system, so if the utterance doesn’t mention blocks it won’t use block-related functions, but otherwise we assume that “taller than” can always be translated into taller_than.

This, of course, is untrue for indefinite articles like “it,” “that,” or “them;” their meaning depends entirely on what was said earlier in the conversation. In our system, all such references are translated into a call to the(). This is possible because the Express runtime retains the full execution, including all intermediate results, of every plan in the current conversation (opens in new tab). This data, stored as a dataflow graph, represents our conversational context: things which we’ve already discussed, and may want to later reference. Certain special functions, such as the(), can query that graph, searching for the expression which is being referenced.

In SHRDLU, these indefinite articles were resolved during its parse phase, which transformed utterances into its own version of a plan. Resolution, however, is not always determined by the grammatical structure of the utterance; sometimes we need to understand its semantics. Consider these two commands:

Put the red block beneath the green block, and the pyramid on top of it
Put the red block above the green block, and the pyramid on top of it

Common sense tells us that the pyramid should go on whichever block is above the other. To act on this common sense, SHRDLU had to abandon any meaningful separation of syntactic and semantic analysis, which explains, in part, why it was so hard to extend. In our system, resolution is driven by an entirely separate model, which uses syntactic heuristics where possible and domain-specific semantics where necessary. For most developers, however, it suffices to know that “it” and “that” translate into the().

Constraints

Notice that in the above plan we pass find_block a predicate with the criteria for the block we wish to find:

find_block((b: Block) => taller_than(b, held_block()))

This is because the user hasn’t told us which block they want, they only provided the criteria for finding it. This is called an intensional description, as opposed to an extensional description which specifies the actual entity or entities. In practice, every entity we reference in conversation is referenced intensionally; a reference to “Alice” would be translated into:

the[Person](p => p.name ~= "Alice")

where ~= means “similar to”. When executed, the() will try to find a person named Alice somewhere in the conversational history, but there’s no guarantee one exists. The user may assume that, given who they are, the system can figure out who they mean. Perhaps there’s a particular Alice that they work with, or someone in their family is named Alice. In either case, the user clearly thinks they’ve given us enough information, so we have to figure out what makes sense in the given context.

If the() fails to find a match in the conversational context, it will call a resolver function associated with the Person datatype. But how should a Person resolver, given a user-provided predicate, actually work? We can’t simply scan over a list of all the possible people and apply our predicate as a filter; that dataset lives elsewhere and is unlikely to be easily accessed. Because of both practical and privacy concerns, it will almost certainly be exposed via a service with access controls and a limited API.

Our resolver, then, must translate the predicate into one or more queries to backend services which provide information about people. To do that, we must stop thinking of it as a predicate and start thinking of it as a constraint.

Many developers have likely heard of SAT solvers (opens in new tab), which given constraints on one or more boolean values will try to find satisfying assignments. Given a && !b, it will return a == true, b == false. Given a && !a, it will tell us that the constraint is unsatisfiable. Since a variety of problems (opens in new tab) can be mapped into this representation, SAT solvers are widely used. This capability is generalized by SMT solvers (opens in new tab), which can solve more complex constraints on a wider variety of datatypes.

Neither kind of solver, however, has a way to specify “the value must correspond to an entity in a backend service.” Even if they did, we probably wouldn’t want to use it; we don’t want the solver to fire off dozens of queries similar to “Alice” to the backend service while searching through possible values. Only the domain developer building atop our dialogue system understands the capabilities and costs of their backend services. The query API for a service, for instance, might offer its own “similar to” operator. Their similarity metric, however, probably won’t reflect that some people use “Misha” and “Mikhail” interchangeably. The domain developer will have to maintain a balance between preserving the user’s intent and minimizing the number of requests they make per utterance.

Since we can’t fully interpret the constraint for the domain developers, we must provide them their own tools for interpretation. Domain functions which, like resolvers, interpret constraints are called controllers. In the current version of our system, controllers are typically written in TypeScript, since that language is likely to be a familiar and expressive way to write complex domain logic. Within the controller, predicates are transformed into constraint zippers, which allow them to traverse, query, and transform constraints on complex datatypes. For each field and sub-field, domain developers can ask various questions: are there lower or upper bounds? What is an example of a satisfying value? Is that the only satisfying value? Does this value satisfy the constraint?

This last question is crucial, because we won’t always be able to encode the entire constraint in our query to the backend service. The set of results we get back may be too broad, and therefore must be post-filtered using the constraint. Conversely, operators which correspond to query operators in the service’s API, like ~=, can be configured as abstract named properties. Upon navigating to Person.name, we can look for an abstract property of ~=, and examine its argument’s zipper to construct our query.

Early AI researchers envisioned a world where knowledge had a singular representation and a singular repository. Instead, we live in a world where data, and the ability to interpret it, is fragmented and diffuse. As a result, our constraint solver must be unusually extensible, allowing developers to compose it with their own systems and domain expertise.

Revision

A major challenge in interpreting a user’s intent is everything they leave unsaid. Stripped of any context, much of what we say is ambiguous. To interpret “I’m headed to the bank,” we need to know whether the speaker is near a river. In linguistics, the study of how context confers meaning is called pragmatics (opens in new tab). Our dialogue system, then, needs to provide tools for developers to easily specify domain-specific pragmatics.

For example, if in Outlook Mobile a user says, “reschedule my meeting with Alice to next week,” we can reasonably assume they mean an upcoming meeting, because almost everything we do in our calendar focuses on upcoming events. If we believed this was always true, we could simply take every user intension about an event and further constrain it to start in the future:

def add_pragmatics(predicate: Event => Boolean): Event => Boolean = {
  e => predicate(e) && e.start > now()
}

But what if the user wants to reschedule a past meeting that was cancelled? If we apply the above function to “reschedule yesterday’s meeting with Alice to next week,” the event will be constrained to both be yesterday and in the future; the constraint will be unsatisfiable. We can’t, then, simply mix our default assumptions into whatever the user provides; we have to allow them to be selectively overridden, just like any other default value. Fortunately, we have a solution which is general across all domains:

def add_pragmatics(predicate: Event => Boolean): Event => Boolean = {
  revise(
    e => e.start > now(), 
    predicate,
  )
}

In our system, revise is a powerful operator that, given two constraints a and b, will discard the parts of a which keep b from being meaningful, and conjoin the rest onto b. Consider a query for “yesterday’s meeting”, where we revise some basic pragmatics with the user’s intension:

revise(
  e => e.start > now() && e.attendees.contains(me()), 
  e => e.start.date == yesterday(),
)

Our default assumptions are that the event being referenced starts in the future and will be attended by the user. The first clause of those defaults, however, contradicts the user’s intension. The result of our revision, then, will consist of the second default clause and the user’s intension:

e => e.start.date == yesterday() && e.attendees.contains(me())

Simply looking for contradictions, however, isn’t enough. Consider a query for all the events since the year began:

revise(
  e => e.start > now() && e.attendees.contains(me()), 
  e => e.start > beginning_of_year()
)

In this case, the user’s intension isn’t contradicted by our default assumptions, but it is implied by them. If an event starts in the future, it necessarily occurs after the year began. If we don’t drop e.start > now(), we will effectively ignore what the user said.

Since both contradiction and implication are concerned with intrinsic properties of a datatype (as opposed to extrinsic properties like “this corresponds to an entity in a backend service”), our system can handle the revision process on its own. Developers can simply focus on defining the appropriate pragmatics for their domain.

The existence of a revision operator, combined with the fact that users speak intensionally, also means that we can give users the ability to tweak and build upon what they’ve already said.

Consider the utterance “cancel my meeting with Alice.” If the user and Alice work on the same team, it’s likely they have more than one upcoming meeting together. We can guess at which one they mean, but before actually cancelling the meeting we will show them a description of the event and ask for confirmation.

Typically, confirmation involves giving the user a choice between “OK” and “cancel;” either we did exactly what they wanted, or they need to start over. Revision, however, means we don’t need to start over. If the user follows up “cancel my meeting with Alice” with “I meant the one-on-one,” we’ll revise the first intension with the second, and look for a one-on-one with Alice.

This is enormously freeing for the user, because it means they don’t need to fit everything they want into a single, monolithic utterance. This is akin to the difference between batch and interactive computing; users can try things, see what happens, and quickly build upon their successes.

This is also enormously freeing for the developer, because it means they can afford to get things wrong. We provide the best tools we can to help developers interpret the user’s intent, but the cost of misinterpretation is small. In the worst case, the user will be forced to provide incrementally more information.

Final Thoughts

Wherever possible, business logic should be described by code rather than training data. This keeps our system’s behavior principled, predictable, and easy to change. Our approach to conversational interfaces allows them to be built much like any other application, using familiar tools, conventions, and processes, while still being able to take advantage of cutting-edge machine learning techniques.

When revisiting ideas from this earlier era of research, however, we must be careful; used wholesale, they’re likely to send us down the same path as the people who first proposed them. Sometimes, as with plans, we have to make minor modifications. Sometimes, as with constraints, we have to acknowledge complexities that weren’t even imagined by early researchers. Sometimes, as with revision, we have to create something entirely novel.

Doing this well requires a team with a wide variety of interests and expertise. In addition to people with expertise in computational linguistics, we’re also looking to hire people with backgrounds in programming language runtimes, constraint solvers, and SDK design. If this sounds like you, and everything described above sounds like something you’d want to work on, let us know (opens in new tab).

The post Designing a Framework for Conversational Interfaces appeared first on Microsoft Research.

When does text prediction benefit from additional context? An exploration of contextual signals for chat and email messages

Microsoft Research Team — Wed, 26 May 2021 00:47:27 +0000

By Stojan Trajanovski, Chad Atalla, Kunho Kim, Vipul Agarwal, Milad Shokouhi, and Chris Quirk

Email and chat communication tools are increasingly important for completing daily professional and personal tasks. Given the recent pandemic and shift to remote work, this usage has surged. The number of daily active users in Microsoft Teams, the largest business communication and chat platform, has increased from 20 million pre-pandemic time in 2019 to more than 115 million and 145 million in October 2020 and April 2021, respectively. On the other hand, email continues to be the crucial driver for formal communication showing ever increasing usage. Providing real-time suggestions for word or phrase auto-completions is known as text prediction. The efficiency of these communications is enhanced by suggesting highly accurate text predictions with low latency. Text prediction services have been deployed across popular communication tools and platforms such as Microsoft Outlook Text Predictions or GMail Smart Compose [1].

Modern text prediction algorithms are based on large language models and generally rely on the prefix of a message (characters typed until cursor position) to create predictions. We study to what extent additional contextual signals improve text predictions in chat and email messages in two of the largest commercial communication platforms Microsoft Teams and Outlook.
We examine several signals accompanying the main message: composition time, subject, and previous messages (see the Table below).

Contextual signal	Details
Composition time	It is a contextual signal which can provide added value for text prediction, enabling suggestions with relevant date-time words, like “weekend”, “tonight”.
Subject	Message subjects often contain the purpose or summarized information of a message. In the email scenario, we use subject as context. In the chat scenario, we use the chat window name as a proxy for subject.
Previous email	Previous messages can provide valuable background information which influences the text of the current message being composed. In the email case, we create pairs of messages and replies.
Previous chat messages	Prior message contextualization for chat scenario is much more complex. Chat conversations typically consist of many small messages sent in quick succession.

We combine and encode these signals with the message body into a single “contextualized” string for the language model, using special tokens to separate from the other signals (Figure 1).

Figure 1. Context extraction and encoding.

We segment chat histories by message blocks and time windows. A series of uninterrupted messages sent by one sender is considered as a single message block. Messages sent within the past N minutes are within a time window, which enforces recency as a proxy for relevance. We define three previous message context aggregation modes in the chat scenario (visualized in Figure 2), mimicking prior email context:

Ignore-Blocks: chat messages from the current sender, in the past N minutes (e.g., 2, 5, 10 minutes), ignoring any message block boundaries.
Respect-Blocks: chat messages from the current sender, in the past N minutes, confined to the most recent message block.
Both-Senders: chat messages from both senders, in the past N minutes. When the sender turn changes, strings are separated by a space or a special token.

Figure 2. Aggregating a 5 min prior chat window in various context modes.

For example, 2 minutes Both-Senders mode and 5 minutes Ignore-Blocks aggregate similar amount: 2.5 chat messages on average and 56-59% of chat messages have at least one message as context. Given the email and chat message length statistics from Figure 2, we expect chat messages to be about 10 x smaller than emails. Namely, in a statistical analysis of the chat message lengths (see Figure 3, blue box) we find that mean tokens number is 9.15, while median tokens number is 6. On the other hand, email (see Figure 3, green box) mean number of tokens is 94, while the median is 53 tokens. So, we limit chat histories to 20 messages, which is roughly equivalent to an email-reply pair in length.

“1 email formal content = 10 x chat (informal) messages”

Figure 3. Box-plot statistics for messages aggregation in Teams and Outlook.

For ethical considerations of how we process data through multiple privacy precautions; according to General Data Protection Regulation (GDPR) and with “fair block-listing” of denigrative, offensive, controversial, sensitive, and stereotype-prone words and phrases, check out our NAACL paper [2].

Results. Previous message contextualization leads to significant gains for chat messages from Microsoft Teams, when using an appropriate message aggregation strategy. By using a 5-minute time window and messages from both senders, we see a 9.4% relative increase in the match rate¹, and an 18.6% relative gain on estimated characters accepted. This 5-minute window of prior messages from both senders outperforms the corresponding 2- and 10-minute window configurations. Chat messages are often short and can lack context about a train of thought; thus, the appropriate number of previous messages can bring necessary semantics to the model to provide a correct prediction. Benefits are comparatively insignificant for subject and compose time as contextual signals in chat messages. In the email scenario based on Microsoft Outlook, we find that time as a contextual signal yields the largest boost with a 2% relative increase on the match rate, while subject only helps in conjunction with time, and prior messages yields no improvement. We conclude that the different characteristics of chat and email messages impede domain transfer. The best contextual text prediction models are custom trained for each scenario, using the most impactful subset of contextual signals. Future work involves exploring different encodings for contextual signals, such as utilizing hierarchical RNNs to better capture context, or using more advanced architectures such as transformers, generative models or GPT-3.

References

[1] M. X. Chen, B. N. Lee, G. Bansal, Y. Cao, S. Zhang, J. Lu, J. Tsay, Y. Wang, A. M. Dai, Z. Chen et al. (2019) Gmail Smart Compose: Real-time assisted writing, In Proc. of the 25th ACM SIGKDD Intl. Conf. on Knowledge Discovery & Data Mining, pp. 2287–2295.

[2] S. Trajanovski, C. Atalla, K. Kim, V. Agarwal, V. Shokouhi, and C. Quirk (2021) When does text prediction benefit from additional context? An exploration of contextual signals for chat and email messages, In Proc. of NAACL-HLT (Annual Conf. of the North American Chapter of the Association for Computational Linguistics – Industry track papers).

¹The ratio of the number of matched suggestions and the total number of generated suggestions.

The post When does text prediction benefit from additional context? An exploration of contextual signals for chat and email messages appeared first on Microsoft Research.

Assistive AI Makes Replying Easier

Microsoft Research Team — Tue, 19 May 2020 15:05:19 +0000

Microsoft’s mission is to empower every person and organization to achieve more. So, we are constantly looking for opportunities to simplify workflows and save people time and effort. Sending replies to email or chat messages is a common activity and people spend considerable amount of time on it. By harnessing the power of AI, we are helping people reply faster by intelligently suggesting replies which can be used to easily respond to messages with a simple click or tap on the device. For email messages, people can then edit the response or hit the ‘Send’ button, while for chat messages, the reply is immediately sent. The difference in behavior is designed keeping in mind that for chats, people can break their reply content into a few adjacent chat snippets and so immediately sending the reply offers the quickest workflow. We have also expanded the feature to multiple international languages like Spanish, Portuguese, French and Italian and plan to roll out to several new languages and markets in the next year. This feature currently saves users hundreds of millions of keystrokes each month.

This feature is powered by a deep neural network trained on hundreds of millions of Messages (Emails or chats) and their Replies called Message-Reply (MR) pairs. Since all the data for training the model is user content, the models are trained in an eyes-off fashion by leveraging an experimentation stack built on top of Office 365 and Azure technologies which is fully compliant with user privacy and enterprise contractual obligations. The platform offers complete data security and GDPR compliance for customer data. For evaluation, we use a variant of Rouge for comparing the model predictions and the ground truth reply and assign a score for each prediction. Using this automated metric allows us to run the evaluation in an eyes-off fashion. In addition to the eyes-off evaluation, we also do some qualitative evaluations on public and personal emails to get a better understanding of the model predictions and to improve quality of suggestions.

Our system models the problem of suggesting responses to messages as an Information Retrieval task where given a message, suggestions are selected from a fixed list of responses called the Response set. The messages and responses are encoded with parallel networks and the system is trained to match them for real Message-Response pairs using a Symmetric loss function. We use Transformer encoder networks for encoding the Messages & Responses. Since the Response side encoder is pre-computed, we use larger number of layers (12) there compared to the Message side (6). The entire model is initialized using Microsoft Turing model for natural language representation (NLR), a large-scale model pioneered by Microsoft and then fine-tuned for the task of matching Messages and Responses. As we train large models on millions of MR pairs, we are leveraging optimization breakthroughs like ZeRO to fit larger batch sizes in memory and obtain impressive gains in training speed and model performance. Overall, Suggested replies is a great example of AI at Scale powering next generation AI experiences.

Generating an appropriate set of possible responses is a critical step in training our Suggested replies models. Our response set generation algorithms are built to ensure strict privacy protections. First, among hundreds of millions of replies that are present in the datasets we only consider short popular responses that are syntactically void of any personally identifiable information. Next, we employ state of the art Differentially Private algorithms to further narrow the set of response to those that can be exposed while adhering to rigorous privacy requirements^[1]. Later, the chosen snippets of text are brought outside of compliance boundary and further curated by humans to ensure that the content is generic, grammatical and fair (i.e., does not include offensive, inappropriate or biased statements).

We are committed to honoring your trust and continuing to improve our system in a privacy-preserving and compliant manner. We are actively working on making the suggestions align to users unique writing style and making the feature available across multiple languages worldwide. Stay tuned for more and in the meantime please let us know if you have any feedback, we always love to hear from you.

[1] Currently we use differentially private algorithms with privacy parameters (ε=4, δ< 10^-7)

The post Assistive AI Makes Replying Easier appeared first on Microsoft Research.

Better Document Previews using the Microsoft Turing Model for Natural Language Representations

Microsoft Research Team — Tue, 19 May 2020 15:03:32 +0000

Figure 1: Inside Look preview for a document in Microsoft SharePoint

By Rahul Jha (opens in new tab) and Payal Bajaj

Knowledge workers spend close to 20% of their time searching for and gathering information (opens in new tab). When using document management systems such as Microsoft OneDrive and SharePoint (opens in new tab) people find themselves looking at directories full of documents. Interacting with such a list of documents can be time-consuming without a mechanism for previewing the documents.

The Inside Look (opens in new tab) feature in OneDrive and SharePoint helps people get to relevant documents quickly by providing a short summary of documents as previews (Figure 1). These summaries give the people a glimpse of the document content and help them decide quickly whether a document will be useful for their information need without opening the document.

Document summaries are distinguished as being either indicative or informative. Indicative summaries alert a person to the topic of the document and help them decide whether they should open it fully. Informative summaries, on the other hand, attempt to convey all the important points of a document so the person doesn’t need to open the original document [1] (opens in new tab). Summaries can also be either extractive or abstractive. Extractive summaries only use original text from the document, while abstractive summaries can contain new text not originally in the document. We’ve formulated our solution of generating document previews for Inside Look as indicative, extractive summarization.

Our initial summarization models identified important sentences by using a set of cues based on the structure of the document, formatting information, as well as bag-of-words signals. We started with a rule-based model, followed by a ranking model [2] (opens in new tab) trained using data annotated by our hierarchical annotation methodology, Artemis. As shown in Figure 2, judges can use Artemis to build summaries in a bottom-up manner, starting by creating paragraph-level summaries and successively building higher level summaries by reusing sentences selected at earlier levels. We’ve described our annotation process in detail in this paper [3] (opens in new tab).

For the next iteration of our model, we turned to semantic, transformer-based language representations based on Microsoft Turing model for natural language representation (NLR) (opens in new tab). We experimented with a contextual transformer model based on BertSum [4] (opens in new tab) for scoring the sentences for Inside Look summaries. This model creates a representation for each sentence in the document based on its surrounding sentences, which is then used to assign a score to each sentence. This score is used as an additional feature in our ranking model, along with cues based on structure and formatting. Using the Microsoft Turing NLR-based large-scale models, we saw a 36% improvement in our offline relevance metrics.

Figure 2: Our hierarchical annotation methodology, Artemis, described in more detail in [3]

After several rounds of model distillation and performance optimizations, we obtained a production-ready model with the desired relevance and performance characteristics. To help make the final ship decision, we conducted side-by-side evaluations where judges were shown summaries from our previous model and the Turing NLR-based model in random order and asked them which summary they preferred. In a side-by-side experiment conducted on more than 150 documents with 5 judges per document, we found that judges preferred Turing NLR based summaries 63% of the time, while they preferred summaries from our previous model only 22% of the time (there was no preference in 15% of the cases). These results were statistically significant. Based on all our evaluations, we shipped the Turing NLR-based models for the Inside Look feature early in 2020.

Qualitatively, summaries from the new model tend to be shorter, more focused, and more readable. For example, the figure below shows two summaries for the same technical document. The one on the left was generated by our earlier model, while the one on right is generated by our Turing NLR-based model. The summary from the earlier model starts with some disclaimer information and then moves to specific details about the topic in question without introducing the topic. The summary from the Turing NLR-based model introduces the main topic (Analysis Services) in the first sentence and provides the goal of the article in the second sentence. It is much more readable than the summary from our earlier model.

Earlier model	Turing NLR-based model

Similarly, the figure below shows summaries generated by the two models for a government document. The summary from our earlier model jumps quickly between topics and mentions VAWA without defining it. The Turing NLR-based summary is more focused and first defines the main topic, VAWA, along with its acronym expansion. It then provides some elaboration on the topic.

Earlier model	Turing NLR-based model

We are excited to see how the Inside Look feature helps people save time and get to the information they need quickly. Going forward, we plan to expand our use of AI at Scale (opens in new tab) to continue improving summary quality for Inside Look.

References

[1] (opens in new tab) Kan et al. (2002) Using the Annotated Bibliography as a Resource for Indicative Summarization (opens in new tab)

[2] (opens in new tab) Burges (2010) From RankNet to LambdaRank to LambdaMART: An Overview (opens in new tab)

[3] (opens in new tab) Jha et al. (2020) Artemis: A Novel Annotation Methodology for Single Document Indicative Summarization (opens in new tab)

[4] (opens in new tab) Liu and Lapata (2019) Text Summarization with Pretrained Encoders (opens in new tab)

The post Better Document Previews using the Microsoft Turing Model for Natural Language Representations appeared first on Microsoft Research.