Using LLMs for safe low-level programming | Microsoft Research Forum

Research Forum Episode 5 | Pantazis Deligiannis and Aseem Rastogi

Presented by Aseem Rastogi and Pantazis Deligiannis at Microsoft Research Forum, Episode 5

Aseem Rastogi, Principal Researcher, and Pantazis Deligiannis, Principal Research Engineer from Microsoft Research FoSSE (Future of Scalable Software Engineering) discuss the technical results from ICSE’2025 on using Large Language Models (LLMs) for safe low-level programming. The results demonstrate LLMs inferring machine-checkable memory safety invariants in legacy C code, and how LLMs assist in fixing compilation errors in Rust codebases.

Other Episode 5 talks

All previous talks

Microsoft research copilot experience How are LLMs transforming programming practices, enhancing safety and efficiency?

Transcript: Lightning Talk

LLMs for safe low-level programming

Aseem Rastogi, Principal Researcher, Microsoft Research FoSSE (Future of Scalable Software Engineering)
Pantazis Deligiannis, Principal Research Engineer, Microsoft Research FoSSE (Future of Scalable Software Engineering)

This talk covers two technical results from ICSE 2025 on using large language models (LLMs) for safe low-level programming. The results demonstrate LLMs inferring machine-checkable memory safety invariants in legacy C code and how LLMs assist in fixing compilation errors in Rust codebases.

Microsoft Research Forum, February 25, 2025

FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: The following talk combines two projects that both harness the LLM’s capabilities to understand and produce code. Both aim to help developers tackle the difficulties of safe low-level programming. One to ensure memory safety in legacy C code; the other presents RustAssistant, a tool for developers to automatically fix compilation errors in Rust.

ASEEM RASTOGI: Hi, my name is Aseem Rastogi, and I’m a researcher in the Future of Scalable Software Engineering organization in Microsoft Research. I’m going to talk to you about our paper, “LLM Assistance for Memory Safety.” This paper will be presented at the 47th International Conference on Software Engineering in May later this year.

The lack of memory safety in low-level languages like C and C++ is one of the leading causes of software security vulnerabilities. For instance, a study by Microsoft estimated that 70% of the security bugs that Microsoft fixes and assigns a CVE every year are due to memory safety issues. Researchers have proposed safe dialects of C, for example, Checked C, that—with the help of additional source-level annotations—provide memory safety guarantees with low performance overheads. However, the cost of adding these annotations and the code restructuring required to enable them becomes a bottleneck in the adoption of these tools. In general, application of formal verification to real software faces the same challenge.

In our paper, we explore the use of pretrained large language models to help with the task of code restructuring and inferring source annotations required to adopt Checked C. Let’s consider an example that takes an array of integers as input and sums the first n elements. To reason about the memory safety of this function, Checked C requires an annotation on p. One such annotation is as shown here. This tells the compiler that p is an array with at least n elements, which is enough to ensure the safety of memory accesses in this function. It also helps impose an explicit obligation on the callers of this function that they must pass an appropriately sized array to it.

Our goal is to infer such annotations with the help of LLMs. For this problem, LLMs seem like a perfect match. It is hard to encode reasoning about real-world code and complex code patterns in symbolic tools. LLMs, on the other hand, have demonstrated tremendous code comprehension and reasoning capabilities similar to what programmers have, even for real-world code. Second, LLM hallucinations might lead to incorrect annotations, but they cannot compromise memory safety. Once the annotations are added to the code, the Checked C compiler guarantees memory safety even when the annotations are incorrect. This way, we get best of both worlds!

However, working with LLMs for whole program transformations in large codebases represents another challenge. We need to break the task into smaller subtasks that can fit into LLM prompts while adding relevant symbolic context to each prompt. Put another way, in order for LLMs to be able to reason like programmers, we need to provide them context that a programmer would otherwise consider. Our paper presents a framework for doing just that with the help of program dependence graphs working in tandem with LLMs. We implement our ideas in a tool called MSA and evaluate it on real-world codebases ranging up to 20,000 lines of code. We observe that MSA can infer 86% of the annotations that state-of-the-art symbolic tools cannot. Although our paper focuses on memory safety, our methodology is more general and can be used to effectively leverage LLMs for scaling the use of formal verification to real software—most importantly, doing so without compromising on the soundness guarantees. We are really excited about this research direction.

Up next, my colleague Pantazis will tell you about how we are leveraging LLMs to make it easier for the programmers to adopt Rust. Thank you.

PANTAZIS DELIGIANNIS: Hello, everyone. I’m Pantazis, and today I will be presenting our work on leveraging the power of large language models for safe low-level programing. Specifically, I will focus on our recent paper about RustAssistant, which is a tool that uses LLMs to automatically fix compilation errors in code written in Rust. This work was done together with other individuals that are listed on the screen and will appear in the International Conference on Software Engineering later this spring.

OK, let’s dive in! Why do we care about safe low-level programing with Rust? So the Rust programing language, with its memory and concurrency safety guarantees, has established itself as a viable choice for building low-level software systems over the traditional, unsafe alternatives like C and C++. These guarantees come from a strong ownership-based type system, which enforces memory and concurrency safety at compile time. However, Rust poses a steep learning curve for developers, especially when they encounter compilation errors related to advanced language features such as ownership, lifetime, or traits. At the same time, Rust is becoming increasingly more popular every year, so as more and more developers adopt Rust for writing critical software systems, it is essential to tackle the difficulty in writing code in Rust.

In Microsoft Research, we created a tool called RustAssistant that leverages the power of state-of-the-art LLMs to help developers by automatically suggesting fixes for Rust compilation errors. Our tool uses a careful combination of prompting techniques as well as iteration between a large language model and the Rust compiler to deliver high-accuracy fixes. RustAssistant is able to achieve an impressive peak accuracy of roughly 74% on real-world compilation errors in popular open-source Rust repositories on GitHub.

OK, let’s now see how RustAssistant works step by step. Let’s begin with the first step: building the code and parsing the build errors. Such errors can range from simple syntax mistakes to very complicated issues involving traits, lifetimes, or ownership rules in Rust code spread across multiple files. So when a developer writes Rust code that doesn’t compile, the Rust compiler generates detailed error messages that include the error code, the location of the error, as well as documentation and examples related to this error code.

To illustrate this process, let’s look at this very simple example on the screen. In this case, the developer is trying to compare a custom VerbosityLevel enumeration in their code using the greater-or-equal operator. However, the Rust compiler throws an error, stating that this binary operation cannot be applied to VerbosityLevel. The compiler suggests that the reason behind this error is because VerbosityLevel does not implement a trait that is required for performing such comparisons in Rust. This detailed error message is precisely what RustAssistant captures at this step, preparing it for the next stage of processing.

At the next step, RustAssistant takes this detailed error information that is generated in the previous step and focuses on extracting the specific parts of the code that are directly relevant to this error. Looking at the example on the screen, the code snippets related to the enumeration and its use in the log_error function are automatically extracted by our tool. This includes not only the problematic line of code but also other code snippets that provide necessary context for understanding and resolving the error. The tool also captures the error details, such as the error code and the accompanying compiler suggestion about the missing trait for performing the comparison. These extracted code snippets and error details are then packaged into a prompt for the LLM. This ensures that the LLM receives only the essential information required to suggest an accurate fix without being overwhelmed by irrelevant parts of the codebase. This careful localization step is crucial for both efficiency and accuracy, especially when dealing with very large codebases.

Now let’s move to the last step. Here, RustAssistant sends the carefully localized prompt, which includes the error details and the relevant code snippets, to the large language model API. The LLM generates a proposed fix, formatted as a code diff—in other words, does not include the entire code snippet for efficiency but only the new, edited, or deleted code lines. For example, in the case of our build error, the LLM suggests adding the missing traits to the enumeration, as shown here on the screen. This fix ensures that the comparison using the greater-or-equal operator will now work as intended. Next, RustAssistant parses this suggested fix and applies the changes to the appropriate file in the codebase. Once the fixes are applied, our tool runs again the Rust compiler to verify if the build error has been resolved. If the code compiles, then great news! The process is now complete, and we can do further validations like running any unit tests.

However, if new errors appear or if the fix doesn’t fully resolve the issue, RustAssistant sends the updated context back to the LLM, iterating until the code compiles error free. And this iterative process allows our tool to handle complex, multi-step fixes while ensuring correctness and alignment with the developer’s intent. Of course, the example that I showed here is a very simple one, but you can imagine the tool being able to fix much more complicated build errors.

To summarize, I presented a quick walkthrough of how RustAssistant can be used to help developers automatically fix build errors in their Rust codebases. In our paper, we evaluated RustAssistant on the top hundred Rust repositories on GitHub and showed that it can achieve an impressive peak accuracy of roughly 74% on real-world compilation errors. We invite you to read our ICSE paper as it not only discusses the evaluation results in detail but also dives into interesting technical details, such as how we designed our prompts as well as various techniques that we developed for scaling RustAssistant on very large codebases without losing accuracy.

Thank you for listening.