Discover the different test methods<\/a><\/div>\n<\/div>\n\n\n\nFor example, your HR agent configuration might include three separate graders: <\/p>\n\n\n\n
\nGeneral quality grader to assess whether the response is complete and addresses the full question.<\/li>\n\n\n\n Classification grader, where you describe the expected behavior as using natural language prompts.<\/li>\n\n\n\n Capability grader to confirm the agent uses the right topic or tool at the right time.<\/li>\n<\/ol>\n\n\n\nEven better, you can make these expectations explicit: what matters, what does not, and what “good behavior” looks like in this scenario. By defining evaluation logic upfront, you’ll reduce ambiguity, make success observable and explainable, and shift quality from subjective judgment to measurable signal. <\/p>\n\n\n\n
Step 4: Set the right identity context <\/h3>\n\n\n\n Once you’ve outlined what you’re testing, you need to define when the evaluation should run. Specifically, which user profile should the agent act like is sending the questions when it’s being evaluated?<\/p>\n\n\n\n
The user context you select determines the agent’s behavior, including what data it can retrieve and reason over. It also ensures evaluations catch permission\u2011related risks early, such as inappropriate data access.<\/p>\n\n\n\n
So, making this choice explicit helps avoid a common source of false confidence. When results are reviewed later, makers can trust that successes and failures are grounded in the same access boundaries their users will experience.<\/p>\n\n\n\n
For example, an HR agent that references internal policy articles may behave very differently if it’s responding to a full-time employee or a contractor. <\/p>\n\n\n\n
Running the evaluation under only the intended user identity ensures evaluation results reflect real conditions rather than an idealized setup.\u00a0This can help you identify and mitigate unexpected behavior, such as sharing your company’s healthcare options with a contractor.<\/p>\n\n\n\n
Step 5: Evaluate the agent’s responses<\/h3>\n\n\n\n Now, it’s time to run your evaluation<\/a>.\u00a0Based on the data you provided, Copilot Studio simulates real user prompts and the agent generates responses, curated to your prescribed user context. Each configured grader then evaluates a different aspect of the response, such as quality, correctness, or capability.<\/p>\n\n\n\nThis evaluation process turns individual answers into structured signals. Together, these signals make agent behavior observable, repeatable, and explainable at scale. <\/p>\n\n\n\n
The maker is no longer relying on intuition or spot checks to assess their agent’s quality. They’ve created a disciplined feedback loop that replaces assumptions with evidence and transforms agent quality from a subjective impression into a measurable outcome.\u00a0<\/p>\n\n\n\n
Step 6: Step back to see the bigger picture<\/h3>\n\n\n\n Once your evals gather sufficient signals, your focus shifts outward: \u201cWhat does this tell me overall?\u201d <\/p>\n\n\n\n
Aggregated results provide a high-level view of quality, consistency, and trends across scenarios and graders. For the HR agent, this might reveal strong performance on common policy questions, but weaknesses around edge cases or escalation behavior. <\/p>\n\n\n\n <\/figure>\n\n\n\nWith these signals, you can better prioritize. Not every failure matters equally. Patterns matter more than anomalies. And evaluation becomes a decision-support tool, not just a reporting surface. <\/p>\n\n\n\n