AWS re:Invent 2025 - Customize models for agentic AI at scale with SageMaker AI and Bedrock (AIM381)

Published: (December 5, 2025 at 08:37 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Overview

In this session, Amit Modi and Shelbee demonstrate Amazon SageMaker’s new capabilities for building agentic AI applications at scale. They introduce serverless model customization with a broad selection of foundation models and fine‑tuning techniques—including reinforcement learning—plus serverless MLflow for unified observability, and serverless model evaluation with industry benchmarks and AI‑as‑a‑judge metrics.

The demo walks through an end‑to‑end workflow:

  • Customizing Qwen 2.5 for a medical‑triage agent
  • Tracking experiments and datasets as versioned assets
  • Evaluating against MMLU clinical benchmarks
  • Deploying to SageMaker endpoints
  • Integrating with the AgentCore runtime via the Strands SDK

Key highlights

  • Automatic lineage tracking
  • SageMaker Pipelines integration with new deployment steps for Bedrock
  • Multi‑model endpoints with adapter‑based inference (≈ 50 % cost savings)
  • Speculative decoding (≈ 2.5× latency reduction)

The session addresses four critical production challenges: lack of standardized customization tools, fragmented observability, evolving ML‑asset tracking needs, and complex inference optimization.

  • Rapid adoption of agentic AI in enterprise software: projected to grow from 1 % in 2024 to 33 % in 2028 (a 33× increase).
  • By 2028, ≈ 15 % of decisions are expected to be made autonomously by agents, driving high compute and model requirements for fast, cost‑effective inference.

Production Challenges

  1. No standardized model‑customization tools

    • Teams build ad‑hoc workflows with glue code, then must rewrite them for production, causing delays and manual effort.
  2. Fragmented observability

    • Disparate tools make it hard to debug failures or detect deviations in model/agent behavior.
  3. Evolving ML‑asset tracking

    • Beyond models, teams must version reward functions, prompts, and other artifacts used in reinforcement learning, adding integration overhead.
  4. Cost‑effective, high‑quality inference

    • Selecting optimal instance types, containers, and frameworks requires extensive benchmarking, often leading to expensive or delayed deployments.

SageMaker Capabilities

Amit Modi (Senior Manager, Model Operations & Inference) and Shelbee (Worldwide Specialist Senior Manager, Gen AI) outline how SageMaker tackles these challenges:

Serverless Model Customization

  • Broad foundation‑model catalog (public models plus Bedrock models)
  • Fine‑tuning techniques: supervised, reinforcement learning, and more
  • Fully serverless: no capacity planning or GPU management; SageMaker handles infrastructure, checkpointing, and node recovery automatically

SageMaker Studio UI

  1. Navigate to SageMaker Studio → Models.
  2. Choose a foundation model and a fine‑tuning technique (UI, SDK, or agent experience).
  3. Upload a dataset or select an existing, versioned dataset from SageMaker.
  4. Pick or define a reward function:
    • Write inline code, or
    • Attach a pre‑registered Lambda that implements the reward logic.

SageMaker automatically checkpoints jobs, enabling seamless recovery from node failures and ensuring efficient compute usage.

Pipeline Integration

  • SageMaker Pipelines now include purpose‑built steps for:
    • Model customization
    • Deployment to SageMaker endpoints and Bedrock (inference‑as‑a‑service)
  • No glue code required: annotate notebook code with @step or upload via the UI to generate a fully functional pipeline.
  • Pipelines are serverless, eliminating the need to manage underlying compute resources.

End‑to‑End Demo Highlights

  • Customizing Qwen 2.5 for a medical‑triage agent using supervised fine‑tuning.
  • Experiment tracking with MLflow, versioned datasets, and reward functions.
  • Evaluation against the MMLU clinical knowledge benchmark and AI‑as‑a‑judge metrics.
  • Deployment to a SageMaker endpoint and integration with the AgentCore runtime via the Strands SDK.
  • Cost‑saving features: adapter‑based multi‑model endpoints (≈ 50 % cheaper) and speculative decoding (≈ 2.5× lower latency).

Key Takeaways

  • Standardized, serverless customization removes the need for manual infrastructure management.
  • Unified observability via serverless MLflow simplifies debugging across models and agents.
  • Versioned ML assets (models, datasets, reward functions, prompts) support governance and compliance.
  • Optimized inference (adapter‑based multi‑model endpoints, speculative decoding) delivers significant cost and latency improvements.

These advancements aim to accelerate the path from prototype to production for agentic AI applications, addressing the major bottlenecks that have historically slowed adoption.

Back to Blog

Related posts

Read more »