AFMR: Model Advancement - Microsoft Research

A Human-Aligned Automated Evaluation Framework for Natural Language Generation via Large Language Models

IIT Kharagpur: Pawan Goyal (PI)

The research proposes a novel automated evaluation framework for natural language generation via large language models. The framework aims to overcome the limitations of existing evaluation metrics that fail to fully capture the nuances of LLM-generated content. It also seeks to mitigate biases such as positional, length, and self-enhancement biases that are often present in such models. The framework will undergo rigorous testing across diverse tasks, including summarization and mathematical reasoning. The research also aims to explore the performance of various LLMs and develop new metrics for evaluating their outputs.

Accelerating research cooperation with modern foundation models in FAIR data repositories

Harvard University: Stefano Maria Iacus (PI)

The proposal aims to explore the capabilities of Foundation Models to facilitate the discovery of research and research data. The aim is to enrich metadata and build semantic knowledge graphs to accelerate knowledge acquisition. OpenAI models and fine-tuned open source models such as Llama-2, Falcon or MPT will be utilized.

Advancing Foundation Models Towards Physical AI: Bridging the Gap Between Natural Language and Wireless Sensing

University of California, Riverside: Amr Magdy (PI)

This project will investigate how to generate linguistic summaries in natural language from real-time millimeter-wave and sub-terahertz radar data. Beamforming and range/speed tracking capabilities of radar systems will be leveraged to enable human and environmental context perception. Radar datasets will be pre-processed, and useful features will be utilized as input to large-scale language models. The ultimate objective of this project is to enable the fusion of wireless sensing with natural language, revealing concealed patterns within unstructured radar signatures. Radars have already been used as sensing modality for vital signs detection and human behavior recognition. However, existing research on radar-based vital signs detection and human activity monitoring focuses on traditional machine learning models tailored to the application use case, which may require considerable expertise from the end user and cannot adapt to the constant changes in realistic scenarios. To tackle these challenges, large language models will be utilized to assist radar signal analysis. In this study, commercially available radar platforms will be employed to obtain data associated with typical human activities. Algorithms for signal processing will be developed to combine radar data with Generative Pre-Trained Transformer models, and insights into the strengths and weaknesses of existing foundation models will be garnered.

Aligning Foundation Language Models to Human Goals: An Ensemble Approach

Carnegie Mellon University: Carlee Joe-Wong (PI)

This proposal presents a novel approach for improving alignment of Foundation Language Models (LMs) to human goals. The focus is on creating an ensemble of LMs incorporating cost constraints, human feedback, and strategic utilization of different LMs. The team plans to employ online learning mechanisms, particularly reinforcement learning, to optimize this process. The approach will be validated using various datasets.

Bidirectional Learning of Large Language Models (LLMs) through Discussions 📝

Tokyo Institute of Technology: Naoaki Okazaki (PI)

Naoaki Okazaki proposes research to examine whether large language models (LLMs) can benefit from a more dynamic, interactive, and bidirectional learning process that mirrors human cognition. This approach intends to improve how LLMs understand and generate language, addressing limitations observed in current LLMs’ performances in complex reasoning tasks. The research introduces a paradigm of using real-time, adaptive discussions between two LLMs for training, with one LLM serving as the ‘learner’ while the other as a ‘discussion partner’.

Related papers:

Bridging the Data Gap Between Large Language Models and Human Children 📝

Stanford University: Michael Frank (PI)

The proposal aims to bridge the gap between large language models (LLMs) and human learning. The project plans to enhance the efficiency and interpretability of LLMs, lowering their data and model size requirements, and shedding light on human cognitive models and efficient language acquisition capabilities. The research will mainly focus on improving the quality of data used for training LLMs, enhancing evaluation benchmarks for comprehensive language understanding and employing innovative techniques to bring the capabilities of LLMs closer to those of human children.

Related papers:

Bringing Order to Language Models

Seoul National University: Seung-won Hwang (PI)

Study how large language models (LLMs) may interleave interaction with external environments, such as search engines, more effectively, toward reducing widely known LLM weaknesses such as hallucinations in answering questions. However, this requires teaching LLMs, how to interact with environments, and which external skills can be leveraged from them. Existing work can be categorized by human-in-the-loop adaptation, requiring heavy human feedbacks (HF). Our distinction is mining “the rationale for how to perform such goal better” from the environment itself, to replace expensive HF.

Broadening large language models’ impact with small language models

Carnegie Mellon University: Graham Neubig (PI)

The project aims to extend the use cases for large language models (LLMs) by developing smaller, deployable models. The team plans to improve their Prompt2Model framework with advancements in automatic data wrangling, multilingual distillation, and better dataset generation algorithms. This research could lead to significant advancements in the application and reach of LLMs.

Building Foundation Models for Efficient Finetuning or Zero-Shot Learning of Sequential Decision-Making 📝

University of Maryland: Furong Huang (PI)

This proposal focuses on the development of foundation models for sequence decision-making, with online and offline stages. The offline stage involves exposure to tasks, datasets, and domains for wide-ranging understanding, while the online stage fine-tunes the pretrained representations to specific tasks. The result is a foundation model that is beneficial for a diverse range of decision-making scenarios.

Related papers:

Characterizing the Memory Space of Large Language Models in Processing Relational Structures in Textual Inputs

Cornell University: Jon Kleinberg (PI)

The proposal aims to study the capability of Large Language Models (LLMs) in processing and remembering relational structures within texts. It sets to understand this through various research questions centered around LLMs ability to extract these structures, and investigate memory compression patterns for potential parallels with human cognitive processes.

ChatGPT-Symbolic Models for Reasoning, Planning & More

IIT Delhi: Mausam (PI)

This project proposes to study the potential of Large Language Models (LLMs) in solving complex reasoning and planning problems. The PI’s team plans to develop hybrid models that combine LLMs with traditional AI approaches, and study how their performance varies with increasing problem complexity. The underlying hypothesis to be tested is whether existing reasoning/planning models and LLMs can complement each other’s strengths and weaknesses. Proposed use-cases include solving Sudoku problems or other puzzles, planning problems, knowledge-graph answering, and task-oriented dialogue.

Computational Intelligence in Active Space Selection: A Frontier Hypothesis for Multireference Simulations

University of Chicago: Laura Gagliardi (PI)

The proposal aims to improve the computational efficiency of multireference material simulations by interfacing Large Language Models (LLMs) with existing active space techniques. The proposed methods are expected to automate the efficient partitioning of Hilbert spaces by fine tuning foundational models to specialize in automated active space selection in Multiconfigurational Quantum Chemistry. We will investigate the capabilities of AI agents in performing electronic structure simulations and benchmark their performance based on model parameters and the level of fine-tuning. The study will investigate the strengths and weaknesses of LLMs in performing high-fidelity electronic structure calculations, aim to posit use cases, and suggest avenues for improvement in the domain of LLM-enhanced multireference electronic structure simulations.

Cost-efficient Next-generation LLM Serving System

University of California Berkeley: Alvin Cheung (PI)

The primary objectives of this project are twofold. First, we aim to significantly accelerate the inference speed of Large Language Models (LLMs), thereby facilitating faster model serving and more efficient real-time interactions. The second goal is to substantially reduce the computational cost associated with LLM inference. By achieving this, we can make these advanced models more accessible and scalable, allowing for broader deployment across various industries and applications. Both objectives are integral to enhancing the utility and performance of Large Language Models, and thereby contribute to making them a more viable and effective solution for a wide range of challenges.

Democratizing Robust Data Analysis for Causal Inference with LLM-Based Interactive Optimization

University of Southern California: Angela Zhou (PI)

The proposal aims to democratize robust data analysis for causal inference using LLM-based code generation in natural language for interactive optimization. Through development of novel methods for in-context learning and structuring LLM outputs, it aims to transform scientific data interpretation and experimental data synthesis.

Driving Scene and Behavior Generation with LLM-based Reasoning 📝

University of California Berkeley: Wei Zhan (PI)

The proposal aims to combine 3D scene understanding with large language models to realize 3D scene generation, a pivotal aspect of simulation technology. The team is set to employ large language models and pre-trained behavior/diffusion models to generate high-fidelity 3D scenes and social behavior. The approach can find wide applications in autonomous driving, robotics, and prompt learning.

Related paper:

WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning (opens in new tab)

Effective and Transparent Collaboration Among LLM Agents 📝

École Polytechnique Fédérale de Lausanne: Robert West (PI)

This project aims to address the challenges and benefits of collaboration among large language model (LLM) agents. We focus on the effectiveness, transparency, and safety aspects of LLM collaboration, with the goal of identifying principles for designing effective and transparent collaborations.

Related paper:

Evaluating Language Model Agency through Negotiations (opens in new tab)

Efficient, Controllable, and Scalable Language Models through Retrieval 📝

University of Washington: Hanna Hajishirzi (PI)

The proposal presents Training and inference methods for retrieval-based large language models (LLMs) addressing issues like parameter inefficiency, lack of control, and factual incorrectness. The goal is to design LMs that are more controllable, efficient, factual and generalized towards multiple domains.

Related paper:

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (opens in new tab)

Eliciting Syntactic Reasoning to Overcome Surface Heuristics in Large Language Models 📝

New York University: Tal Linzen (PI)

Large language models (LLMs) are argued to be able to learn to perform a task from a handful of examples given “in context”, without weight updates. How robust is in-context learning to distribution shifts between the examples and the target instances of the task? We address this question via evaluations of LLMs’ syntactic generalization and inductive biases, which can reveal whether models rely on linguistically principled structural features or unreliable surface features (such as word positions). We also investigate whether out-of-distribution generalization is affected by—or can be improved by—chain-of-thought prompting, where the model is provided with a sequence of intermediate computation steps that illustrate how the task ought to be performed. Highlights: Models pre-trained on code (like Codex) perform exceptionally well—better than models with an order of magnitude more parameters! Their high performance also correlates well with whether chain-of-thought prompting can improve performance further. We conclude that in-context learning and chain-of-thought prompting may be more robust in models trained on code.

Related paper:

In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax (opens in new tab)

Enabling Large Language Models to Reason about Time-Series 📝

University of Washington: Tim Althoff (PI)

The project aims to enable large language models (LLMs) to reason about time-series data. This involves interactive reasoning, data description and summarization, and statistical query assessment. The work involves the generation of large-scale multi-domain synthetic pairs of time-series and related user queries, and the development of time-series encoders.

Related paper:

Correcting misinformation on social media with a large language model (opens in new tab)

Enhancing Faithfulness of Large Language Models with Open Knowledge Graph

University of Wisconsin-Madison: Junjie Hu (PI)

The proposal targets at enhancing the faithfulness of large language models (LLMs) through the use of structured Chain-of-Thought (CoT) technique. Existing CoT methods often create unfaithful information without grounding to actual world evidence, the proposed project seeks to resolve this by utilizing structured and unstructured data to enhance the truthfulness of LLMs in open-domain multi-hop question-answering.

Enhancing LLMs’ Logical Reasoning Capabilities through Hybrid Reasoning 📝

University of Leeds: Anthony Cohn (PI)

We aim to improve the logical reasoning capabilities of large language models (LLMs) by enhancing semantic parsing, integrating symbolic reasoning techniques, providing customized training, and conducting evaluation. The focus will be on developing a logic reasoning module and integrating it with LLMs, fine-tuning LLMs on logic-based data, and evaluating the performance of the enhanced AI system on complex first-order logic reasoning tasks.

Related papers:

Evaluating the Commonsense Reasoning Capabilities of LLMs

University of Leeds: Anthony Cohn (PI)

We will work to evaluate and improve the commonsense reasoning abilities of large language models (LLMs) through a variety of approaches. This will involve building new benchmarks to evaluate specific aspects of commonsense reasoning and the development of a new dialectical approach that uses multi-turn conversations to test a system’s understanding and consistency. We will focus on assessing comprehension of extended initial texts, making commonsense entailments, using synthetic data for inference, and testing the robustness of responses to variations in queries.

Foundation models and Interactive theorem provers for Mathematical reasoning

Indian Institute of Science: Siddhartha Gadgil (PI)

The proposed research aims to integrate Foundation Models with Interactive Theorem Provers to improve reliability in mathematical reasoning. The project will focus on two primary aspects: proof automation and an approach referred to as autoformalization. The expected outcome is a reasoning system that can be utilized by individuals with limited or no programming language and AI knowledge. Both prompt engineering and fine-tuning techniques will be applied for this project.

GoriLLA: Training LLMs to Retrieve and Invoke APIs

University of California Berkeley: Joseph Gonzalez (PI)

The proposal focuses on training Large Language Models (LLMs) to use tools to take actions on the external world, particularly through discovering and invoking public APIs. The goal is to enable LLMs to understand brief natural language descriptions and autonomously compose and execute the required API calls. The project aims to overcome challenges such as the vast and changing space of APIs, LLM hallucination, and the comprehension and composition of multiple API calls together.

Hallucination Mitigation in Foundation Models and Improved Automatic Chain of Thought Prompting 📝

KAIST: Steven Euijong Whang (PI)

Hallucination in language models refers to generating nonsensical or unfaithful content. Foundation models are especially vulnerable to information leakage through hallucination, posing serious privacy risks for their training data. An emerging line of research to combat hallucination adopts Chain of Thought (CoT) prompting, which offers additional reasoning steps by manually constructed prompts [Wei et al., NIPS’22]. Very recently, automated CoT prompt generation (Auto-CoT) has been proposed to eliminate the need for human effort [Zhang et al., ICLR’23]. While most of works focuses on arithmetic and commonsense reasoning tasks, it is essential to explore with various NLP tasks to better understand the hallucination in different contexts. Hence, our research aims to (1) develop strategies to enhance Auto-CoT’s performance using GPT-3.5 and GPT-4, and (2) extend our investigation to various NLP tasks that can also have privacy issues.

Related paper:

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models (opens in new tab)

Improving Inductive Reasoning in Language Models via Hypothesis Search

Stanford University: Noah Goodman (PI)

The proposal focuses on enhancing the inductive reasoning capabilities of LLMs by generating explicit inductive hypotheses. This is achieved by translating these into Python programs that can be validated against examples and can also be generalized to novel inputs. Initial experiments display improved accuracies and the aim is to further extend these results with the aid of additional compute resources.

Interpretable Multi-Document Summarization Through Large Language Model Probing

University of Michigan, Ann Arbor: Lu Wang (PI)

We aim to analyze the intermediate reasoning capabilities of large language models (LLMs) for long multi-document inputs, particularly when applied to the task of multi-document summarization (MDS). Recent work has shown that summarization-specific pre-trained language models (PLMs) exhibit poor multi-document reasoning and synthesis capabilities, motivating the need for improved modeling of the MDS task. Unfortunately, these limitations have significant implications for sensitive real-world applications requiring holistic understanding of inputs from heterogeneous sources. To this end, we will study whether state-of-the-art LLMs are capable of multi-document reasoning and information synthesis, and investigate whether prompt-based probing of LLMs can yield insights into their decision-making processes for the MDS task. Specifically, we seek to elicit chain-of-thought explanations within promptable LLMs for the purpose of interpreting and improving multi-document summarization approaches.

Large Language Model based Safe, and Aligned Autonomous Systems: An Integrated Software Architecture and Pilot Implementation

Texas A&M University Corpus Christi: Chandra Sekharan (PI)

Recent advances in foundational large language models (LLMs) are already positively impacting application domains such as healthcare, finance, and education. The next evolutionary leap in LLMs lies in their integration with physical systems. This evolution is both exciting and challenging: it simplifies the application layer’s design, development, and deployment while posing the challenge of ensuring alignment and safety. This research introduces a novel 3-layer software architecture, designed to foster safe and aligned autonomous systems. It incorporates a hybridized fine-tuning/RAG code generator to initiate intelligent missions for systems to operate autonomously and safely. A key aspect of this project is an extensive Pilot implementation, which leverages cloud-based LLM API providers (Azure’s OpenAI), Vector Databases, ArduPilot open-source software, and commercial hardware for drones. We will conduct tests and missions at the Lone Star UAS Center of Excellence and Innovation, affiliated with Texas A&M University-Corpus Christi. This center, one of seven FAA-authorized sites in the U.S., provides a regulatory-compliant and safe environment for achieving the project’s goals. The research will be impactful in reducing the complexity of launching missions to help improve the environmental resilience of the Gulf of Mexico region, while engaging under-represented students in research on LLM technologies.

Leveraging Foundation Models to Study Causal, Social, and Moral Reasoning 📝

Stanford University: Tobias Gerstenberg (PI)

The proposal from Stanford’s Causality in Cognition Lab outlines research plans to use Foundation Models fory studying causal reasoning, moral judgments, and social cognition. The proposal outlines methods to automate the construction of causal models, understand the development of category representations, simulate the cultural transmission of scientific knowledge and align these models with human social and moral reasoning.

Related papers:

Leveraging Large Language Models in Causal Inference and Learning

IIT Hyderabad: Vineeth N Balasubramanian (PI)

This proposal aims to increase the robustness, interpretability, and generalizability of deep learning models using causal principles. The study intends to leverage Large Language Models (LLMs) for robust causal inference and causal grounding of representations learned by deep learning models.

Leveraging LLMs for Tackling the Unsolved Abductive Reasoning Task 📝

Gwangju Institute of Science And Technology: Sundong Kim (PI)

The project attempts to leverage LLMs to tackle the unsolved Abstraction Reasoning Corpus (ARC) task and to utilize LLMs for data augmentation purposes. The primary aim is to derive a logical pattern for each task, similar to how a human would do, aided by ARC dataset and LLM APIs.

Related paper:

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus (opens in new tab)

Plugging-In Large Private Relational Databases with LLMs for Seamless Conversational Querying and Exploration

IIT Bombay: Sunita Sarawagi (PI)

Our goal is to tackle three challenges in integrating private relational databases with foundation models for conversational querying and exploration. These include: (1) Efficiently and accurately retrieving the relevant subset of a database schema for a query. (2) Benchmarking the performance of LLMs under ambiguity and generating follow-up questions for clarification. (3) On the fly adaptation of Text-to-SQL generation on complex schema subgraphs.

Public Opinion Simulation for AI Alignment

University of Chicago: James Evans (PI)

LLMs have the ability to accurately reproduce systems of beliefs and attitudes corresponding to different segments of the public. By leveraging this capability, this project seeks to implement “simulated public opinion” analyses as an LLM guardrail. Before an AI automated task is initiated, the LLM will simulate public responses to the proposed actions, generating a diverse set of responses corresponding to a heterogeneous stakeholder population which could be local, national, or global. By this mechanism, actions that would be deemed widely unacceptable by the relevant public can be identified and halted before proceeding. This guardrail may prove to be an important step toward aligning AI systems with the diverse set of values and interests held by the publics affected by algorithmic decision making.

Quantifying and Improving Out-of-Distribution Performance of LLMs on Reasoning Tasks

University of Pennsylvania: Surbhi Goel (PI)

This proposal targets the issue of poor out-of-distribution (OoD) performance in large language models on algorithmic reasoning tasks which serve as elementary building blocks for more intricate reasoning problems. The overarching goal is the development of a rigorous understanding of OoD failures and new algorithmic strategies for their mitigation using synthetic setups as a testbed.

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models 📝

MIT: Song Han (PI)

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-preserving, and general- purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT, BLOOM, GLM, MT-NLG, and LLaMA family. We demonstrate up to 1.56× speedup and 2× memory reduction for LLMs with negligible loss in accuracy. SmoothQuant enables serving 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs.

Related paper:

In the Name of Fairness: Assessing the Bias in Clinical Record De-identification (opens in new tab)

Software Development Using Large Language Models

University of California, Irvine: Brian Demsky (PI)

This project aims to explore the use of LLMs in software development, their limitations, and the potential development of automated tools to improve the efficiency of using LLM for software development. The areas of focus include LLM’s handling of novel programming tasks, ways around the token limit, ensuring code correctness, their ability to help with debugging, and their effectiveness at modifying existing code. Our approach entails developing a set of programming tasks, understanding effective approaches at manually using LLMs for various tasks, and gradually transitioning from manual queries to automated processes.

Systematic LLM Compression with Dense-and-Sparse Quantization 📝

University of California Berkeley: Amir Gholami (PI)

The proposal focuses on improving the inference of Large Language Models (LLMs), particularly Transformer Models, by developing a novel LLM serving framework and two research thrusts: Systematic LLM Quantization and Pruning, and Speculative Decoding with Big Little Hierarchical methods. Through these approaches, the project aims to tackle the growing computational and memory bandwidth requirements of these models, thus enhancing their deployment in applications such as virtual assistants, chat bots and machine translation.

Related papers:

TABLE2JSON: Zero-Shot Table Extraction with Instruction-Tuned Language Models 📝

Georgia Institute of Technology: Alan Ritter (PI)

The project will evaluate the effectiveness of INSTRUCTE, an open-domain method for extracting structured records from tables in scientific literature. A new dataset called ARXIVTABLES, consisting of 3,792 annotated cells across 122 tables from 25 machine learning papers, will be developed for evaluating the proposed task. INSTRUCTE will also be extended to the leaderboard extraction task by linking the extracted data with predefined leaderboards. The contributions of the project include defining the new TABLE2JSON task, introducing the INSTRUCTE prompting method, and constructing the ARXIVTABLES dataset.

Related papers:

Teaching Large Language Models to Cite and Comprehend Long-Text 📝

Princeton University: Danqi Chen (PI)

Interested in understanding the limitations of these powerful LLMs, as well as developing effective methods to improve their capabilities. Our team is studying the limitations of large language models (LLMs) like GPT-3, ChatGPT, and GPT-4, aiming to enhance their capabilities. We are working on two sub-projects: one involves creating a benchmark and modeling methods to improve LLMs’ ability to cite references accurately, while the other focuses on teaching LLMs to excel at long-text reading compression and summarization tasks, which they currently struggle with.

Related paper:

Enabling Large Language Models to Generate Text with Citations (opens in new tab)

Theorem Proving with Large Language Models

Stanford University: Tengyu Ma (PI)

This project aims to enhance the reasoning capacity of Large Language Models (LLMs), enabling them to effectively perform theorem proving and mathematical question answering. Using proofs in formal languages like Lean, Isabelle, and Coq, the models will be fine-tuned using Reinforcement Learning from Human Feedback. This process will be extended to natural language as well. New LLM architectures will also be explored to optimize reasoning capabilities.

Towards Continual Learning and Adaptation of Foundation Models

Polytechnique Montréal: Sarath Chandar (PI)

The proposal aims to develop continuous learning strategies for foundation models, focusing on addressing data distribution shifts, designing effective optimization methods, and aligning models with evolving human preferences. The project focuses on reducing wasteful retraining processes and increasing the time effectiveness of these models.

Using Cognitive Science to Deepen our Understanding of Foundation Models 📝

Princeton University: Thomas Griffiths (PI)

This proposal aims to apply cognitive science methodologies to foundation models, specifically large language models (LLMs), to gain insights valuable to both social sciences and foundation model research. By relating LLMs behavior to human decision-making processes, the project aims to both observe and mitigate current LLMs weaknesses.

Related papers:

Model Advancement

AFMR Goal: Align AI with shared human goals, values, and preferences via research on models