About
I am a Senior Researcher at Microsoft Research AI in the Societal Resilience group. My main area of focus currently is statistical techniques for advancing LLM control and robustness. The complexity of large language models (LLMs) and multimodal modals makes them challenging to predict and control—especially when scaled for real-world applications. Fortunately, statistics theory has much to say about how to analyze the properties of complex systems in ways that enable prediction and control. My work bridges rigorous statistical methods with practical engineering solutions, providing a framework for analyzing, predicting, and enhancing LLM behavior to ensure that models meet expectations reliably and consistently.
Current Areas of Focus
1. Experimental Design for Engineering GraphRAG Enhancements
GraphRAG is a graph-based approach to retrieval augmented generation. I lead efforts to improve accessibility and scalability of GraphRAG by making it less costly to run without sacrificing the quality of the results. This largely involves leveraging fine-tuned small language models like Phi-3 to handle components of the GraphRAG’s answer generation workflow. These enhancements depend on many interconnected variables including models of different sizes, fine-tuning, prompts, and hyperparameters. To that end, I’m leading our team’s effort to build a experimental design platform that enables our engineers to run experiments that optimize these variables.
Learn more about GraphRAG enhancements.
2. Statistical Significance in Adversarial Attacks
A growing trend in adversarial research is to use LLMs to red team other LLMs. In this paradigm, an attacker LLM generates “attack” prompts passed to a target LLM. The goal of the attacker is to cause the target to misbehave, such as by generating text that would be harmful in deployment settings. When these attacks are successful, it is difficult to know if the success occured because of the substance of the attack, or because of other random and non-reproducible conditions.
This is a problem of statistical significance; it is unclear if the target LLM’s response to the attacker LLM’s attacks was statistically significant relative to baseline behavior. My goal with this work is to develop theoretically rigorous tests for statistical significance for these adversarial attacks. For example, in the MedFuzz project, developed in collaboration with clinical experts, I developed an adversarial algorithm to surface model misbehavior in answering medical questions, and created a nonparametric test to provide statistical validation of these successfuls attacks. This ensures that observed medical question-answering vulnerabilities in the targeted LLMs are genuine rather than coincidental, enhancing the model’s robustness in clinical settings.
3. Causal Reasoning in Multimodal Models
I use causal inference theory to evaluate and control causal reasoning with multimodal models. This research seeks to identify what causal inductive biases are needed for identified causal inferences with these models. When these inductive biases are lacking in a given context, I look for ways to inject them by way of the architecture, bespoke learning objectives, fine-tuning, and the prompt.
See our seminal paper on causal reasoning in LLMs – Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
4. Partial Pooling for LLM Benchmarks
It is common practice to evaluate LLMs by evaluating their accuracy on LLM benchmarks, comprised of individual vignettes used to prompt the LLM. But these basic accuracy statistics don’t reveal what kinds of errors the LLM is making. It is difficult to identify common themes in the errors, since benchmark items tend to overlap across many different themes. This research addresses this problem by combining LLM clustering methods with Bayesian hierarchical modeling’s ability to partially pool information across different groups of data.
See a recent work Walk the Talk, where we leverage this technique to quantify unfaithfulness – how LLM explanations differ from their true answer generation process.
Other Work
In a research collaboration with Madeleine Daepp, we study how generative AI is changing political messaging in democracies. This work was featured in The Economist.
I am author of a book called Causal AI, the result of a passion-project to unite graphical causal inference with deep generative AI. Here are some testimonials:
Causal AI is a timely and comprehensive resource in meeting growing demands for AI systems that generate and understand causal narratives about our world. Robert Ness has done a fantastic job guiding the reader, step by step, from the mathematical principles of causal science to real life applications, integrating ideas from reinforcement learning, generative AI and counterfactual logic. The code examples in this book are state-of-the-art, and will help readers ramp up quickly with many case studies and motivating applications. It’s exciting to have a learning resource to recommend that tightly integrates so many powerful ideas while remaining accessible and practical.
~ Judea Pearl, UCLA, author of Book of Why and Causality
There are many theoretical books on causality (by Pearl, Robins, et al), and a few practical books (eg by Kolnar), but this book by Robert Ness combines the best of both worlds. He clearly explains the principles and assumptions behind topics such as counterfactual queries, while also giving examples of how to implement these ideas using various Python toolboxes (such as PgmPy, DoWhy, and Pyro). He goes beyond standard statistical textbooks by discussing topics of interest to people in machine learning, such as using deep neural networks, Bayesian causal models, and connections with RL and LLMs.
~ Kevin Murphy, Google AI, author of Probabilistic Machine Learning
AI, causal inference, and Bayesian modeling are at the forefront of modern data science, and this book expertly combines all three. The book covers a scalable workflow from basics to advanced applications, including a thorough treatment of Bayesian approaches to causal inference. Robert weaves ideas from statistics, machine learning, and causal inference into an accessible guide, with numerous business-relevant examples from tech, retail, and marketing. This text is invaluable for developing robust and explainable AI systems grounded in causal thinking, with clear applications to real-world business problems.
~ Thomas Wiecki, Founder of PyMC Labs, Core Developer for PyMC
Integrating causality into AI crucially breaks through the ‘black box’ barrier to interpretability, resulting in models that are more robust and capable of reasoning, explaining, and adapting. Robert Osazuwa Ness demystifies causal AI with a code-first, hands-on approach, using accessible tools like PyTorch and DoWhy, and breaking down complex concepts into implementable, digestible steps. Causal AI serves as both a textbook and a reference, and is written in an engaging, conversational style that clarifies concepts and welcomes newcomers, while spanning advanced content for seasoned users. As data scientists, machine learning researchers, and tech innovators, we will benefit by being able to move beyond correlation, harness domain knowledge, and build intelligent systems grounded in causality. I’m excited to share this book with students and colleagues, and watch it shift the conversation from mere prediction to the power of reasoning in AI.
~ Karen Sachs, Founder of Next Generation Analytics and Aeon Bio
I received my Ph.D. in statistics from Purdue University, where my dissertation focused on Bayesian active learning for causal discovery. I’m a Johns Hopkins SAIS alumni and a graduate of the Hopkins-Nanjing Center.