Collab AI Research Articles

Actions Speak Louder Than Prompts: Rethinking How LLMs Reason Over Graph Data

Microsoft Research Team — Tue, 03 Mar 2026 21:59:12 +0000

By Ben Finkelshtein (opens in new tab) (University of Oxford), Silviu Cucerzan, Sujay Kumar Jauhar, and Ryen W. White (Microsoft)

Think about the last time you opened a shared document at work. Behind that simple action lies a complex network of relationships: the colleagues who edited the file before you, the team site on which it is stored, the related documents those collaborators have touched, and the organizational structure connecting all of it. Such collaborative platforms are built on graphs – rich networks of people, content, and activity. A fundamental challenge in making them intelligent is understanding what each node in that graph represents. Should this document be flagged as sensitive? Which files should surface in a colleague’s feed? Does this sharing pattern look anomalous?

These are all instances of node classification: given an entity embedded in a network of relationships, the goal is to assign it a meaningful label. It’s a problem that extends far beyond human collaboration to applications such as fraud detection in financial networks, product categorization in e‑commerce, and road traffic congestion prediction. And it’s a problem where large language models (LLMs) are increasingly being applied.

The appeal for using LLMs is clear. Graph neural networks (GNNs), the traditional tool for this task, must be trained per dataset, don’t transfer across domains, and struggle with the rich textual information that real-world nodes often carry – lengthy document content, detailed product descriptions, user profiles. By contrast LLMs offer a compelling alternative with their broad world knowledge and flexible reasoning capabilities. Yet despite a surge of interest, the field has lacked a principled understanding of how LLMs should interact with graph data, when different approaches work best, and why.

Our new study, “Actions Speak Louder than Prompts (opens in new tab),” – which will appear as an oral presentation at the upcoming ICLR 2026 conference – aims to fill that gap. We conducted one of the largest controlled evaluations of LLMs for graph inference to date, spanning 14 datasets across four domains, multiple structural regimes, and a range of model sizes and capabilities. The result is a set of practical, actionable insights for people building systems that combine language models with structured data – whether in collaborative platforms, social networks, e-commerce, or beyond.

It’s not just what you ask; it’s how you let the model work

When most people think about applying LLMs to a problem, they think about prompting – crafting the right instructions and feeding the relevant information directly into the model’s context window. This is indeed the most common approach in the LLM-for-graphs literature: serialize a node’s neighborhood into text, describe the labels, and ask the model to classify.

However, prompting is only one way an LLM can interact with a graph. To paint a more complete picture of LLM interaction paradigms, we systematically compared three fundamentally different strategies:

Prompting, where the graph neighborhood is serialized into text and presented to the model in a single shot.

GraphTool, a ReAct-style approach where the model iteratively queries the graph through a fixed set of tools by retrieving neighbors, reading features, or checking labels one step at a time.

Graph-as-Code, where the model writes and executes short programs against a structured API, composing arbitrary queries over the graph’s features, structure, and labels.

The progression from prompting to tool use to code generation represents a spectrum of increasing agency, from passively consuming information to actively deciding what to look at and how to process it. Our core finding is that this agency matters. As models are given more flexibility in how they interact with the graph, classification accuracy consistently improves.

Letting LLMs write code over graphs

The standout performer across our evaluation was Graph-as-Code. Rather than constraining the model to a fixed set of retrieval actions or requiring all information to be packed into a prompt, this approach lets the LLM compose targeted programs by combining structural queries, feature lookups, and label checks in whatever way it deems most useful for the node at hand. You can see these results in the table below where performance across long-text homophilic datasets highlights the gap between Prompting and Graph-as-Code, especially on high-degree graphs like wiki-cs.

This advantage is especially pronounced in settings that mirror real-world complexity. Consider a collaborative platform where content nodes carry lengthy document text and are densely connected through sharing, co-authorship, and organizational links, or as another example, e-commerce network where product nodes have detailed descriptions and hundreds of connections. A prompting approach quickly hits the LLM’s context window limit because there is simply too much text from too many neighbors to fit. Graph-as-Code sidesteps this issue entirely: the model selectively retrieves only the information it needs, keeping its context focused and efficient.

In practice, the most valuable real-world graph applications tend to involve exactly this kind of dense, text-rich network. Collaborative content graphs, recommendation systems, fraud detection networks, social platforms aren’t small, sparse toy problems, but rather large-scale networks where nodes carry rich information and have many connections. For practitioners building intelligent features over these graphs, our findings suggest that investing in code-generation interfaces for LLMs may yield substantially better outcomes than refining prompts.

Challenging conventional wisdom on graph structure

A common conclusion and conception, cited in many publications in the LLM-for-graphs literature, is that these models struggle on heterophilic graphs – networks where connected nodes tend to have different labels rather than similar ones. The intuition is straightforward: if an LLM relies on neighborhood cues to classify a node, and those cues are misleading (because neighbors belong to different classes), performance should suffer.

In collaborative platforms, people frequently work across organizational boundaries – an engineer collaborates with a designer, a finance team shares documents with marketing. The resulting graphs don’t have the neat clustering that homophily assumes. The same is true of networks of web-page links, interdisciplinary research, and many social networks.

Our results tell a different story. Across four heterophilic datasets all three LLM interaction strategies performed well, consistently outperforming classical baselines like label propagation. Our results challenge the assumption that LLMs are inherently limited to homophilic settings and suggest they can extract useful signal from node features and non-local patterns, rather than relying solely on neighborhood voting.

This broadens the applicability of LLM-based graph reasoning to the messy, cross-cutting networks that real-world systems operate on.

Understanding what LLMs rely on

Beyond overall accuracy, we wanted to understand how these models use different types of information. Do they lean on textual features? Graph structure? Known labels? And does this change depending on the interaction strategy?

To answer this, we ran a series of controlled ablations – systematically removing edges, truncating text features, and deleting labels – and tracked how accuracy responded. The results, visualized as 2D heatmaps, revealed a striking contrast.

Prompting degrades predictably: remove edges or labels, and accuracy drops along both axes. The model needs both structure and labels to function, and it has no way to compensate when either is degraded.

Graph-as-Code, by contrast, displays a remarkable adaptability. On homophilic datasets where structure is informative, it relies on edges. On heterophilic datasets where features matter more, it shifts to text. When labels are removed but features and structure remain, it is barely impacted. Performance only suffers when multiple sources of information are simultaneously degraded.

This adaptive behavior is a key property of the code-generation paradigm. Because the model can compose arbitrary queries, it naturally gravitates toward whichever signal is most informative for the task at hand – a kind of emergent robustness that doesn’t need to be explicitly engineered. For systems operating over real-world data, where information is often incomplete or noisy, this resilience is especially valuable.

Design principles for LLM-graph systems

Our study yields several practical guidelines for building systems that combine LLMs with graph-structured data:

Match the interaction mode to the graph’s characteristics. For small, sparse graphs with short text features, prompting may suffice. But as graphs grow denser, features grow longer, or the application demands robustness, code-generation approaches like Graph-as-Code offer clear advantages.

Don’t rule out LLMs for heterophilic graphs. Prior assumptions about LLM limitations in low-homophily settings appear to be an artifact of studying only the prompting paradigm. With the right interaction strategy, LLMs are effective across structural regimes, including the cross-cutting, boundary-spanning networks common in collaborative and organizational settings.

Think beyond prompt engineering. In graph applications, how the model accesses information matters at least as much as what instructions it receives. Investing in richer interaction interfaces – tool use, code execution, structured APIs – can unlock performance that no amount of prompt tuning will achieve.

These principles reflect a broader shift in how we think about LLMs: not as static question-answering systems, but as agents that can plan, explore, and compose actions to solve complex reasoning tasks. Graphs, with their rich relational structure and diverse information types, are a natural proving ground for this agentic paradigm.

Looking ahead

As LLMs continue to grow in capability the advantages of agentic interaction modes are likely to compound. Our results already show that larger models and reasoning-enabled variants consistently improve performance across all interaction strategies. But critically, the gap between prompting and code generation persists at every model scale, suggesting that interaction design is a complementary axis of improvement to model scaling.

For teams building intelligent features on collaborative platforms, knowledge graphs, or any system where entities are connected by rich relationships, this work offers a clear message: the way you let an LLM engage with your data can matter as much as the model itself. As the ecosystems of people, content, and activity that power modern productivity tools continue to grow in scale and complexity, principled approaches to LLM-graph interaction will only become more important.

The title of our paper captures the core insight: when it comes to LLMs and graphs, actions truly do speak louder than prompts.

Learn More

Read the full paper: Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference (opens in new tab)

The post Actions Speak Louder Than Prompts: Rethinking How LLMs Reason Over Graph Data appeared first on Microsoft Research.

Experiential Reinforcement Learning

Microsoft Research Team — Fri, 20 Feb 2026 17:39:02 +0000

By Taiwei Shi, Sihao Chen, Longqi Yang, Jaime Teevan

Reinforcement Learning is at the core of building and improving frontier AI models and products. Yet most state-of-the-art RL methods learn primarily from outcomes: a scalar reward signal that says whether an attempt worked, not why it failed. When an agent writes code that doesn’t compile, for example, it may only receive a 0/1 score (“failed” vs. “worked”). Lacking an explanation or a concrete path to correction, the agent must try again, often thousands of times, until incremental parameter updates eventually produce a successful solution. This is particularly problematic for collaborative scenarios that are social, long-horizon, and hard to score.

Humans don’t learn that way. When you get better at collaborating, for example, it’s rarely by seeing success or failure alone; you talk through what went wrong, share context, and adjust together. Teamwork improves through reflection, not just outcomes. Today’s AI agents largely lack this reflective loop. Experiential Reinforcement Learning (ERL) asks: what if an agent could pause, reflect on its mistakes, and use those insights to improve?

The core idea: learning through experience

Rather than relying on imitation or blind retries, ERL teaches an agent to turn feedback into structured behavioral revision. The method follows five steps:

Make an initial attempt: Given a task, the model produces a first response and receives feedback from the environment (including text feedback and a scalar reward).
Get an evaluation: The environment assesses the attempt and returns information about what happened and what should change.
Reflect on what went wrong: When performance is suboptimal, the model generates a structured reflection describing how to improve. The reflection is conditioned on the task, the initial attempt, the feedback, and cross-episode memory of previously successful reflections.
Try again using that insight: Guided by its reflection, the model produces a revised second attempt and receives new feedback and reward.
Internalize what works: Through selective supervised distillation, effective second attempts are absorbed into the base policy. Over time, the model learns to produce improved behavior directly from the original input, without requiring reflection at deployment time.

How ERL mirrors how humans learn from experience

The underlying idea isn’t new; it’s new in this context. In the 1980s, education researcher David Kolb argued that people learn most effectively by cycling through experience and reflection: you have a concrete experience, reflect on what happened, form a revised understanding, and then try again. That cycle (experience, reflect, conceptualize, experiment) helps explain why one student learns from a failed exam while another simply retakes it. ERL can be seen as a computational version of Kolb’s cycle: the first attempt is the concrete experience; the reflection is the reflective observation; the revised second attempt puts a new conceptualization into practice. Finally, the internalization step, where successful corrections are distilled back into the policy, mirrors how people eventually stop needing to consciously work through the cycle because the lesson becomes automatic.

Results

Across agentic reasoning and tool-use tasks, ERL consistently outperforms standard RL. The largest gains appear in settings with minimal upfront instruction, environments where the agent must infer the “rules of the game” through interaction. In these open-ended regimes, reflection and revision become a primary driver of learning, and ERL is most valuable precisely where outcome-only RL tends to struggle.

Looking ahead: learning through interaction in human-AI collaboration

Experience-driven learning could become a core primitive for future intelligent systems shifting AI from optimizing outcomes to accumulating understanding through interaction.

The real promise of ERL points to a future where AI learns to collaborate with people. Human collaboration isn’t a fixed environment with a clean reward signal; it’s fluid, social, and deeply contextual. A good collaborator reads the room, adapts to a partner’s working style, recovers gracefully from misunderstandings, and builds a shared history of what works.

Today’s AI agents don’t do much of that; they often treat each interaction as if it’s the first. With ERL, an agent could reflect on why a conversation went sideways, revise its approach, and internalize the lesson for next time. Over time, it might learn that one user prefers concise answers, while another values detailed reasoning, and it could adapt accordingly. In effect, the agent’s way of working with you could become more personalized and reliable, like a trusted colleague.

ERL offers a concrete mechanism, not just a vision, for how AI might get there: not by hard-coding social rules, but by learning them the way people do, through experience.

Learn More

Read the paper: “Experiential Reinforcement Learning” (opens in new tab)

The post Experiential Reinforcement Learning appeared first on Microsoft Research.

From One to Many

Microsoft Research Team — Mon, 09 Feb 2026 23:51:58 +0000

By Jaime Teevan, Chief Scientist & Technical Fellow

In recent years we’ve all lived through the transition to cloud computing, a sudden shift to remote work, and now the rapid rise of AI. Each individually has felt like a seismic event, but in reality they are all just chapters in one ongoing story: the digital evolution of collaboration. The latest chapter, AI, has so far focused on boosting individual productivity. The result has been real gains—fewer emails, faster drafts—but also a creeping cost. When content is generated without deep engagement, the burden shifts downstream. The next frontier isn’t simply faster solo work, it’s using AI to unlock the full potential of human collaboration.

From prompts to purpose. AI can be used to help teams work better together. People have millennia of experience collaborating, but collaboration breaks down at boundaries—across time zones, languages, and scale. AI can bridge these gaps, but only if we shift from optimizing individual prompts to aligning on shared purpose. That means designing organizational systems and work practices that support joint goal-setting, distributed grounding, and collective evaluation. It’s not enough for one person to prompt well; the team must co-construct the context that guides the model.

From documents to dialogue. After decades of creating and sharing knowledge in the form of documents, knowledge work is now becoming conversational. Instead of authoring artifacts from scratch, teams now co-create through interaction—brainstorms, chats, meetings—and AI turns those into persistent memory. This shift demands new representations, not just documents, but dialogic artifacts that reflect the social process of their creation, along with new knowledge systems that can reason across them. We need to be able to track evolving intent, synthesize across modalities, and preserve epistemic provenance.

From solo to social. A key challenge to making the above possible is that today’s models have been designed as fundamentally single-user. Making them collaborative requires teaching them to have social intelligence: the ability to model turn-taking, resolve conflicting inputs, and adapt to group norms. This is a frontier in model training, evaluation, and interaction design. We’re investing in new data, signals, and architectures to support multi-party alignment and shared agency.

Collaboration is the constant. The heart of work has always been working together. Tools evolve—from email to cloud docs to AI agents—but our need to connect, align, and build together endures. The history of technology is a story of removing barriers to collaboration: distance, delay, language barriers, information overload. AI is the next chapter in that story. But until it reshapes how we brainstorm, solve problems, and share understanding as teams, we’re only scratching the surface of its potential. That’s the vision we’re pursuing: AI that helps us work better, together. This site shares our breakthroughs, our failures (because science is about learning what doesn’t work), and the questions we’re excited to explore next.

Welcome to Collab AI—we’re glad to have you in the conversation.

The post From One to Many appeared first on Microsoft Research.