LLexus: an AI agent system for incident management

When operating a software service on a cloud, the complexity of keeping multiple distributed components responsive is a significant challenge for engineering teams. Engineers frequently rely on Troubleshooting Guides (TSGs) to navigate how to mitigate performance or outage incidents. However, the effectiveness of TSGs is often hindered by their length, implicit reliance on tribal knowledge, and the variable quality of their content. This paper introduces LLexus, an agent-based AI system to automate the execution of TSGs.