Overview of agent evaluation

As AI agents take on a critical role in business processes, the need for reliable and repeatable testing becomes essential. Agent evaluation allows you to create tests that simulate real-world scenarios for your agents.

These tests involve multiple questions and conversations more quickly than manual, case-by-case testing. You can then measure the accuracy, relevance, and quality of responses from agent interactions, based on the information the agent has access to. By using the results from the testing tool, you can optimize agent behavior and verify that the agent meets your quality and business requirements.

Why should we use automated testing?

Agent evaluation provides automated, structured testing. It helps detect problems early, reduces the risk of false positives, and maintains quality as the agent evolves. This process provides a form of automated, repeatable quality assurance for agent testing. It ensures the agent meets your business's accuracy and reliability standards and provides transparency regarding its performance. It has distinct advantages over testing using chatbots.

You run reviews and view results using the Copilot Studio interface , via the Power Platform REST API, or by adding actions in the tool, flow, or Power Automate.

Agent reviews measure accuracy and performance, not ethical or safety issues, of the AI. An agent might pass all review tests but still give an inappropriate answer to a question. Clients should still use responsible AI reviews and content safety filters; these reviews do not replace those reviews and filters.

Limitations of Government Community Cloud

Evaluating agents in a Government Community Cloud (GCC) environment has the following limitations:

  • Creators cannot add user profiles to their test suite. However, creators can still run reviews without user profiles.
  • The creator cannot use similarity testing methods for reviews. All other testing methods are available.

How the agent rating feature works.

Copilot Studio uses one test case for each agent evaluation. Each test case is a unique interaction that simulates how a user interacts with your agent. The interaction can be a single question or an entire conversation.

A test case could also include the answer you expect your agent to give. For example:

  • Question: What are your working hours?
  • Expected answer: We are open from 9 a.m. to 5 p.m., Monday through Friday.

By using evaluation agents, you can create, import, or write a group of test cases yourself. This group of test cases is called a test suite. A test suite allows you to:

  • Run multiple test cases that include various possibilities simultaneously, instead of asking your agent each question individually.
  • Agent performance analysis with an easy-to-understand composite score can also be considered in detail for each individual test case.
  • Test the changes to your agent using the same testing toolset, so you have an objective benchmark to measure and compare performance changes.
  • Quickly create new test toolkits or modify existing ones to include the agent's changing capabilities or requirements.

Each test can evaluate your agent using multiple testing methods simultaneously.

You can also select a user profile to act as the simulated user. The system can be configured to respond to different users in different ways, or allow access to resources in different ways.

When you select a test suite and run a system review, Copilot Studio will send questions in test cases, record the system's responses, compare those responses to expected responses or quality standards, and assign scores to each test case. You can also view details, logs, and activity maps for each test case and the resources the system used to generate the responses.

Develop a comprehensive evaluation strategy.

Before running the evaluation, define success for the system and decide which scenarios are most important to your business results. A clear strategy helps you choose the right testing methods, prioritize high-impact test cases, and interpret results in the appropriate context.

  • Utilize the System Solution Architecture: Evaluation Framework to map business objectives to measurable evaluation dimensions and scoring methodologies.
  • Utilize System Design and Operational Assessment to build repeatable assessment processes that support continuous quality improvement.

Integrate evaluation into automated flows.

Agent evaluation supports automation so creators can run evaluations without manual intervention. By using the REST API or the Power Platform connector, you can programmatically trigger evaluation runs and integrate testing into automated workflows such as continuous integration and continuous deployment (CI/CD). This approach allows you to run test suites at scale and validate agent behavior as changes are introduced, without manual intervention in Copilot Studio.

Chat test versus agent rating

Each testing method provides you with different insights into the agent's qualities and behavior:

Test chat:

  • Receive and answer one question at a time. It's unlikely you'll repeat the same test multiple times.
  • Allows you to view an entire session containing multiple messages.
  • This allows you to interact with your agent as a user through the chat interface.

Agent evaluation:

  • You can create and run multiple test cases simultaneously using a test suite. You can repeat the tests using the same test suite.
  • You can test one question and one answer per test case, or one conversation per test case. However, you have less control over the conversations compared to using the test chat feature.
  • Choose different user profiles to simulate different users without having to complete the interactions yourself.

When testing agents, use both chat testing and agent rating features to get a comprehensive view of your agent.

You've just finished reading the article "Overview of agent evaluation" edited by the TipsMake team. We hope this article has provided you with many useful tech tips and tricks. You can search for similar articles on tips and guides. Thank you for reading and for following us regularly.

Close
Category

System

Windows XP

Windows Server 2012

Windows 8

Windows 7

Windows 10

Wifi tips

Virus Removal - Spyware

Speed ​​up the computer

Server

Security solution

Mail Server

LAN - WAN

Ghost - Install Win

Fix computer error

Configure Router Switch

Computer wallpaper

Computer security

Mac OS X

Mac OS System software

Mac OS Security

Mac OS Office application

Mac OS Email Management

Mac OS Data - File

Mac hardware

Hardware

USB - Flash Drive

Speaker headset

Printer

PC hardware

Network equipment

Laptop hardware

Computer components

Advice Computer

Game

PC game

Online game

Mobile Game

Pokemon GO

information

Technology story

Technology comments

Quiz technology

New technology

British talent technology

Attack the network

Artificial intelligence

Technology

Smart watches

Raspberry Pi

Linux

Camera

Basic knowledge

Banking services

SEO tips

Science

Strange story

Space Science

Scientific invention

Science Story

Science photo

Science and technology

Medicine

Health Care

Fun science

Environment

Discover science

Discover nature

Archeology

Life

Travel Experience

Tips

Raise up child

Make up

Life skills

Home Care

Entertainment

DIY Handmade

Cuisine

Christmas

Application

Web Email

Website - Blog

Web browser

Support Download - Upload

Software conversion

Social Network

Simulator software

Online payment

Office information

Music Software

Map and Positioning

Installation - Uninstall

Graphic design

Free - Discount

Email reader

Edit video

Edit photo

Compress and Decompress

Chat, Text, Call

Archive - Share

Electric

Water heater

Washing machine

Television

Machine tool

Fridge

Fans

Air conditioning

Program

Unix and Linux

SQL Server

SQL

Python

Programming C

PHP

NodeJS

MongoDB

jQuery

JavaScript

HTTP

HTML

Git

Database

Data structure and algorithm

CSS and CSS3

C ++

C #

AngularJS

Mobile

Wallpapers and Ringtones

Tricks application

Take and process photos

Storage - Sync

Security and Virus Removal

Personalized

Online Social Network

Map

Manage and edit Video

Data

Chat - Call - Text

Browser and Add-on

Basic setup