AWS re:Invent 2025 - Improve agent quality in production with Bedrock AgentCore Evaluations(AIM3348)

Published: (December 5, 2025 at 07:45 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Overview

AWS re:Invent 2025 – Improve agent quality in production with Bedrock AgentCore Evaluations (AIM3348)

In this session, Amanda Lester, Vivek Singh, and Ishan Singh introduce Amazon Bedrock AgentCore Evaluations, a fully managed solution for continuous AI agent quality assessment. They demonstrate how to evaluate agents across 13 built‑in dimensions—including correctness, helpfulness, and tool usage—as well as custom evaluators. The session covers both online evaluations for production monitoring and on‑demand evaluations for CI/CD pipelines, using a travel‑agent example to show how tool‑selection accuracy dropped from 0.91 to 0.3, enabling detection in hours rather than weeks. Live demos illustrate setup in under five minutes, trace‑level analysis with detailed reasoning, and CloudWatch dashboard integration for continuous monitoring.

Introduction to Amazon Bedrock AgentCore Evaluations at re:Invent

Introduction thumbnail

Hello everyone and welcome to Amazon re:Invent. My name is Amanda Lester, worldwide go‑to‑market leader for Amazon Bedrock AgentCore. I’m joined by Vivek Singh, senior technical product manager for AgentCore, and Ishan Singh, senior GenAI data scientist at AWS.

In this session we will:

  1. Introduce Amazon Bedrock AgentCore.
  2. Discuss key challenges of operating agents at scale in production.
  3. Provide an overview of our solution—AgentCore Evaluations.
  4. Demonstrate the solution with live demos.
  5. Share best practices and resources to help you evaluate and deploy agents faster.

The Technological Revolution: Amazon Bedrock AgentCore Platform Overview

Platform overview thumbnail
Infrastructure thumbnail

Amazon Bedrock AgentCore is AWS’s most advanced agentic platform, providing a comprehensive set of services to develop, deploy, and operate agents at scale securely. Key capabilities include:

  • Tool and memory integration – enrich agents with external tools and persistent state.
  • Purpose‑built infrastructure – scalable, secure deployment environments.
  • Observability and control – detailed insights into agent operations.
  • Open‑source protocol support – Model Context Protocol (MCP) and A2A (agent‑to‑agent) communications.
  • Framework agnostic – works with any agentic framework of your choice.

These foundations give developers confidence to bring agents into production while addressing the non‑deterministic nature of generative AI.

The Non‑Deterministic Nature of Agents and the Trust Gap Challenge

Non‑deterministic agents thumbnail

Agents can reason, create workflows, and make autonomous decisions without direct supervision. While this autonomy unlocks productivity gains, it also introduces a trust gap:

  • Non‑determinism – the same prompt can yield different outputs, making behavior hard to predict.
  • Risk of incorrect or harmful actions – especially when agents interact with external tools or data.

AgentCore Evaluations addresses this gap by providing systematic, continuous quality assessment across multiple dimensions, enabling developers to detect regressions quickly and maintain confidence in production agents.

Back to Blog

Related posts

Read more »