Jan2024 Articles

Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

Microsoft Research Team — Wed, 08 Mar 2023 00:34:41 +0000

Large language models (LLMs), such as ChatGPT, are able to generate human-like, fluent responses for many downstream tasks, e.g., task-oriented dialog and question answering. However, applying LLMs to real-world, mission-critical applications remains challenging mainly due to their tendency to generate hallucinations and their inability to use external knowledge.

This blog introduces our work on LLM-Augmenter, a system that addresses these very issues by augmenting a black-box LLM with a set of plug-and-play modules: Our system makes the LLM generate responses grounded in external knowledge, e.g., stored in task-specific databases. It also iteratively revises LLM prompts to improve model responses using feedback generated by utility functions, e.g., the factuality score of an LLM-generated response.

We validate the effectiveness of LLM-Augmenter using two types of tasks, information-seeking dialog and open-domain Wiki question answering (Wiki QA). Our experiments show that, across all tasks, LLM-Augmenter significantly improves ChatGPT’s groundedness in external knowledge without sacrificing the humanness of its generated responses. For example, on the dialog task of customer service, the human evaluation shows that LLM-Augmenter improve ChatGPT by 32.3% in usefulness and 12.9% in humanness (measuring the fluency and informativeness of model responses). The Wiki QA task is extremely challenging to ChatGPT in that answering these questions often requires multi-hop reasoning to piece together information of various modalities scattered across different documents. Our results show that although the closed-book ChatGPT performs poorly and often hallucinates, LLM-Augmenter substantially improves the factuality score of the answers (+10% in F1) by grounding ChatGPT’s responses in consolidated external knowledge and automated feedback.

We describe this work in more detail in our paper (opens in new tab), and we make its code available on github (opens in new tab).

Overview

LLM-Augmenter improves LLMs with external knowledge and automated feedback using plug-and-play (PnP) modules, as illustrated in the following example:

LLM-AUGMENTER improves a fixed LLM by (1) consolidating evidence from external knowledge for the LLM to generate responses grounded in evidence and (2) revising LLM’s (candidate) responses using automated feedback.

Given a user query (e.g., regarding a 2013 Los Angeles Galaxy player transfer), LLM-Augmenter first retrieves evidence from external knowledge (e.g., Web or task-specific datasets). If necessary, it further consolidates evidence by linking retrieved raw evidence with related context (e.g., information of the entity “2013 Los Angeles Galaxy”) and performs reasoning to form evidence chains (e.g., table-passage in the figure). Then, LLM-Augmenter queries a fixed LLM (i.e., ChatGPT in our work) using a prompt that contains the consolidated evidence for ChatGPT to generate a candidate response grounded in external knowledge. LLM-Augmenter then verifies the candidate response, e.g., by checking whether it hallucinates evidence. If so, LLM-Augmenter generates a feedback message (e.g., about the team “C.S.D. Municipal”). The message is used to revise the prompt to query ChatGPT again. The process iterates until a candidate response passes the verification and is sent to the user.

Architecture

The architecture of LLM-Augmenter is illustrated in the following figure:

LLM-Augmenter architecture shows how its plug-and-play modules interact with the LLM and the user’s environment.

LLM-Augmenter consists of a set of PnP modules (i.e., Working Memory, Policy, Action Executor, and Utility) to improve a fixed LLM (e.g., ChatGPT) with external knowledge and automated feedback to mitigate generation problems such as hallucination. We formulate human-system conversation as a Markov Decision Process (MDP) that leverages the following PnP modules:

Working Memory: tracks the dialog state that captures all essential information in the conversation so far.
Action Executor: This module performs an action selected by the Policy module. It is composed of two components, the Knowledge Consolidator and the Prompt Engine. The Knowledge Consolidator augments LLMs with the capability of grounding their responses on external knowledge to mitigate hallucination when completing tasks, such as answering questions regarding the latest news, and booking a table in a restaurant. The Prompt Engine generates a prompt to query the LLM.
Utility: Given a candidate response, the Utility module generates utility score and corresponding feedback using a set of task-specific utility functions (e.g., KF1).
Policy: This module selects the next system action that leads to the best expected reward. These actions include (1) acquiring evidence from external knowledge, (2) calling the LLM to generate a candidate response, and (3) sending a response to users if it passes the verification by the Utility module.

The policy can be implemented using manually crafted rules or trained on human-system interactions. In our work, we implement a trainable policy as a neural network model, and we optimize it using REINFORCE. The details of our approach and of these PnP modules are provided in the paper.

Results

Our paper provides extensive experiments on three tasks, but we focus in this blog on the Customer Support task. We compare ChatGPT with and without LLM-Augmenter A total of about 1,000 randomly selected examples from the customer service dataset are used for human evaluation. We observe a strong preference for LLM-Augmenter over ChatGPT alone in terms of both usefulness and humanness. The result is consistent with the automatic evaluation results provided in the paper.

LLM-Augmenter significantly outperforms ChatGPT both in terms of Usefulness and Humanness.

Examples

The following figure shows real examples comparing ChatGPT with LLM-Augmenter:

LLM-Augmenter examples.

The above table provides sample responses contrasting LLM-Augmenter with ChatGPT. First, we can see that ChatGPT fails to provide a response related to specific knowledge related to the user, e.g., a local Indian restaurant. In the second part of the table, we show LLM-Augmenter’s Working Memory, which highlights the richer information retrieved from external knowledge to help the underlying LLM (i.e., ChatGPT as well) generate more contentful responses. The first LLM response received by LLM-Augmenter is unfortunately not satisfactory, as the quality and specificity of LLM generation can be unpredictable. In this case, the Utility module has determined that the first response did not meet its criteria (i.e., KF1 above a given threshold), and issues feedback to the LLM module (i.e., “response is inconsistent with the knowledge”). The second response received by LLM-Augmenter is much more satisfactory according to the utility function and therefore sent to the user.

Acknowledgments

This research was conducted by Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, Jianfeng Gao at Microsoft Research. We also thank Saleema Amershi, Ahmed Awadallah, Nguyen Bach, Paul Bennett, Chris Brockett, Weixin Cai, Dhivya Eswaran, Adam Fourney, Hsiao-Wuen Hon, Chunyuan Li, Ricky Loynd, Hoifung Poon, Corby Rosset, Bin Yu, Sheng Zhang, and members of the Microsoft Research Deep Learning group for valuable discussions and comments.

The post Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback appeared first on Microsoft Research.

FocalNets: Focus Eyes with Focal Modulation

Microsoft Research Team — Wed, 02 Nov 2022 02:05:34 +0000

Human eyes have a dynamic focusing system that adjusts the focal regions in order to see the surroundings at all distances. When we look far away, up close, and back again, our eyes change focus rapidly to allow us to perceive things finely and coarsely. In computer vision (CV), It remains an open question how to build a neural network that can mimic this behavior and feasibly focus on various granularities of visual inputs towards different tasks.

In the past few years, Transformers (opens in new tab) and Vision Transformers (opens in new tab) have led to unprecedented AI breakthroughs in NLP and vision, respectively. For vision particularly, what makes the Transformers stand out is arguably the self-attention (SA) mechanism, which enables each query token to adaptively gather information from others. It learns the dependencies across different visual tokens, which induces better generalization ability than the canonical convolution layer of static kernels. In the visual world, the input signal is often continuous and comes with an arbitrary granularity and scope. Nevertheless, SA is typically used to modeling over a fixed set of predetermined tokens in a specific scope and granularity, and the interactions among individual tokens are usually dense and heavy, which limits their usability in understanding the complicated visual world.

In this blog, we introduce our recent efforts on building neural networks with focal modulation, leading to the new architecture family: FocalNets (opens in new tab). The highlight moments include:

FocalNet achieves new state-of-the-art (SoTA) on the most challenging vision task: COCO object detection (opens in new tab), with 3x small model size and training data size. This marks a milestone that the first attention-free model in the past two years to surpass all Transformer models on the leaderboard.
FocalNet exhibits an intriguing interpretable learning behavior. It can discover and segment objects in an image or a video, while Transformer can hardly do. As the following example shows, the modulation focus maps gradually change from the early, middle to the final stage of perception, which are intuitively interpretable. This suggests FocalNet is capable of different levels of image understanding.

_{(Left) Comparison with SoTA on COCO object detection. Circle size indicates the model size. (Right) Modulation focus maps at the early, middle, and final stages of visual perception with our FocalNet}

We also released the paper on arXiv (opens in new tab), PyTorch codebase on the project GitHub page (opens in new tab), and a HuggingFace demo (opens in new tab). Feel feel to give it a try.

ArXiv Paper

HuggingFace Demo

Github Code

Eye focusing with Focal Modulation Networks

At the core of Focal Modulation Networks (FocalNets) is the focal modulation mechanism: A lightweight element-wise multiplication as the focusing operator to allow the model to see or interact with the input using the proposed modulator; As depicted below, the modulator is computed with a focal aggregation procedure in two steps: focal contextualization to extract contexts from local to global ranges at different levels of granularity and gated aggregation to condense all context features at different granularity levels into the modulator.

The illustration of focal modulation process and the constructed FocalNet

Focal Modulation vs Self-Attention

Similar goals, but different focusing processes. Focal modulation and self-attention are two different ways to enable AI models to selectively focus on certain parts of their input. The self-attention starts with interaction and then aggregation, while the focal modulation starts with aggregation then interaction, which significantly ease the process with much light-weight operations.

Modulation Map vs Attention Map. Both methods learn to focus, but the selected focus areas are quite different. With the standard supervised training of FocalNet and Vision Transformers (ViT) (opens in new tab) on ImageNet, we visualize the modulation map of FocalNet and the attention map of ViT, respectively. We observe that our focal modulation automatically learns an interpretable representation and separates the main object from the background clutter. It learns to segment objects without any form of dedicated dense pixel-level supervision, and the selected focus areas are coherent with the human-generated annotation in the image classification task. In contrast, the selected focus areas of attention maps in ViT are less meaningful and may highlight some spuriously correlated regions.

From top to bottom: Original image, Modulation map, and Attention Map (Images are from ImageNet-1K validation set)

When visualizing the modulation maps in the network for videos, we see that they correspond to coherent semantic regions of the moving objects.

I am excited about our new way of enabling AI to focus on the right parts of the input through focal modulation.
— Johannes Gehrke, Technical Fellow, Lab Director of Research at Redmond, and CTO and Head of Machine Learning for the Intelligent Communications and Conversations Cloud (IC3)

Dense Prediction Tasks with High-Resolution Images

FocalNet is compared against established vision backbone networks, including Vision Transformers (ViT) (opens in new tab), Swin Transformers (opens in new tab) and ConvNeXt (opens in new tab) on different CV tasks, including ImageNet classification (opens in new tab), zero-shot classification on 20 datasets on ICinW (opens in new tab), and dense prediction tasks such as object detection (opens in new tab) and segmentation (opens in new tab). FocalNet consistently outperforms others. The attention-free design of focal modulation can particularly benefit the dense visual prediction tasks with a high-resolution image input, as it allows the model to see a wider scope at different granularities and avoid the heavy burden of token-to-token interaction. Importantly, it achieves a new SoTA 64.3 (test-dev) / 64.2 (minival) on COCO object detection, outperforming the prior arts Swin-v2 Giant and BEIT-3 models with 3x smaller model/data size.

FocalNet consistently shows superior performance on a wide set of computer vision problems

Glad to continue to push on this state-of-the-art computer vision innovation to delight our worldwide Azure Cognitive Services customers.
— Xuedong Huang, Microsoft Technical Fellow and Chief Technology Officer of Azure AI

From the Broader View of Cognitive and Neuroscience

FocalNets mimic human vision. In humans, attention is critical to our ability to focus on specific aspects of environmental stimuli while filtering out other irrelevant information. By definition, visual attention plays a key role in isolating the foreground from the background. Not surprisingly, an algorithm mimicking attention is critical for object recognition in computer vision. Visual attention can be roughly classified into two large categories: feature attention vs spatial attention (e.g. Hayden and Gallant, Neuron 2005 (opens in new tab); Bichot et al, Neuron 2015 (opens in new tab)). Spatial attention directs the movement of eyes to specific locations and therefore is closely linked to the gaze control system. The existing Self-attention (SA) network appears more in line with the spatial attention mechanism of the brain. However, in many cases, we do not know where the object is located or where to focus, but we know it has distinct features. Feature-based attention therefore operates across the visual field and is not closely connected to the eye movement system. Its goal is to construct and maintain an internal representation of the target. Furthermore, in natural human vision, spatial attention and feature attention work together. Importantly, while most studies of visual attention focus on the cortex, it is also well-recognized that the pulvinar nucleus of the thalamus interacts with the cortex and plays a critical role in selective attention. Patients with lesions of the pulvinar nucleus have difficulties in filtering out distractors during attention tasks (Snow JC, et al PNAS2009 (opens in new tab)).

The new algorithm FocalNet appears to better mimic the feature attention system, and hence it performs better in segmenting object from background. This superb ability of FocalNet could be mimicking the dynamic interactions between pulvinar and cortex
— Fan Wang (opens in new tab), Professor of Brain and Cognitive Sciences, Massachusetts Institute of Technology

Focal modulation shares some similar structures with interneurons (opens in new tab)in neural system. (1) One example is the spinal cord: painful information is transmitted to spinal cord, and the projection neurons are the minority of the neurons, most neurons in the dorsal horn are interneurons that process and integrate information and control whether or not painful information is transmitted to higher centers. (2) In motor control, there’s the top-down command, and there’s the final motor neuron output, but for efficient motor control, there are also existing “modules” formed by premotor interneurons that can generate stereotypical patterns such as rhythms and sequences. It makes sense to make interneuron “modules” to specialize in certain processes, and the top-down control can then just play a role in orchestrating these modules. (3) In the somatosensory (body sensory system), while itch and pain are two distinct sensations, the peripheral sensory neurons that detect “itchy” or “painful” stimuli are not so distinct, many of these sensory neurons express “sensor” (receptors) for both itch-inducing and pain-inducing stimuli. The interneurons in the spinal cord play a key role in processing the “ambiguous” incoming information and separate into subsequent “itch” vs “pain” pathways.

A new building block for the next-generation AI models

With FocalNets, the AI research community can build new computer vision systems for high-resolution visual inputs more efficiently. We hope that our experiments will show the community the potential of FocalNets and encourage further adoption of focal modulation.

Acknowledgment: This research was conducted by Jianwei Yang (opens in new tab), Chunyuan Li (opens in new tab), Xiyang Dai (opens in new tab), Lu Yuan (opens in new tab), Jianfeng Gao (opens in new tab). The connections to human vision and neuroscience are drawn by Fan Wang (opens in new tab) and Jinghao Lu (opens in new tab) from MIT. Additional thanks go to the Microsoft Research Horizontal AI Team and Microsoft Alexander Multi-modal team for providing computer resources for large-scale training. We would like to thank DINO team from IDEA, including Lei Zhang (opens in new tab), Hao Zhang (opens in new tab), Feng Li (opens in new tab) and Shilong Liu (opens in new tab), for helpful discussions and detailed instructions of using DINO for object detection. We would like to thank Aishwarya Kamath (opens in new tab) from NYU for sharing the Object365v2 dataset. We would like to thank Lingchen Meng for helping convert contrastive denoising into regular denoising in DINO.

The post FocalNets: Focus Eyes with Focal Modulation appeared first on Microsoft Research.

ECCV Workshop on “Computer Vision in the Wild”

Microsoft Research Team — Thu, 08 Sep 2022 17:55:18 +0000

Please join the Workshop & Challenge on “Computer Vision in the Wild’’ at #ECCV2022

Website: https://computer-vision-in-the-wild.github.io/eccv-2022/ (opens in new tab)

Workshop: The research community has recently witnessed a trend in building transferable visual models that can effortlessly adapt to a wide range of downstream computer vision (CV) and multimodal (MM) tasks. We are organizing this “Computer Vision in the Wild” workshop, aiming to gather academic and industry communities to work on CV problems in real-world scenarios, focusing on the challenge of open-set/domain visual recognition and efficient task-level transfer. Since there is no established benchmarks to measure the progress of “CV in the Wild”, we develop new benchmarks for image classification and object detection, to measure the task-level transfer ability of various models/methods over diverse real-world datasets, in terms of both prediction accuracy and adaption efficiency.

Challenge: This workshop will also host two challenges based on the ELEVATER benchmarks (opens in new tab). It is a platform with 20 image classification and 35 object detection public datasets for evaluating language-image models in task-level visual transfer, measuring both sample-efficiency (#training samples) and parameter-efficiency (#trainable parameters). The two challenges are:

Call for papers and participation: Solving problems on open-set recognition and task-level visual transfer.

ICinW Challenge (20 datasets)

ECCV Workshop & Challenge

ODinW Challenge (35 datasets)

With this collaborative community-effort, we are aiming to evaluate the best vision foundation models and their adaptation methods, which will serve as the references for future large vision model development..

The post ECCV Workshop on “Computer Vision in the Wild” appeared first on Microsoft Research.

Opportunities of fulltime visiting researcher at MSR Deep Learning team

Microsoft Research Team — Tue, 25 May 2021 20:34:13 +0000

The MSR Deep Learning team is working on broad topics centered around deep learning, and has a sub-team specifically focusing on vision-language multimodal intelligence. We are running a visiting researcher program to facilitate MSR-University collaborations. We are looking for university faculties (and incoming faculties) to work with us as fulltime visiting researcher at MSR. We will collaborate on research topics related to computer vision and vision-language multimodal intelligence.

Qualification: university faculty (and incoming faculty)
Work title: visiting researcher (as MSR fulltime employee, with competitive compensation)
Program length: 3 months to 1 year
Responsibilities: lead or co-lead research projects related to computer vision and vision-language multimodal intelligence.

Welcome to contact us (penzhan@microsoft.com, chunyuan.li@microsoft.com, jianwei.yang@microsoft.com) for more details of the program.

The post Opportunities of fulltime visiting researcher at MSR Deep Learning team appeared first on Microsoft Research.