AWS re:Invent 2025 - AWS Trn3 UltraServers: Power next-generation enterprise AI performance(AIM3335)

Published: (December 5, 2025 at 07:51 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Overview

AWS re:Invent 2025 introduced Trainium 3, the next‑generation AI training chip built for agentic workloads and reasoning models. Ron Diamant highlighted its 362 peta‑FLOPS of compute, 20.7 TB of HBM3E memory, microscaling hardware circuits, and accelerated softmax instructions that deliver sustained performance close to peak specifications. The Trainium 3 Ultra server can scale to 144 chips interconnected by NeuronSwitches, providing low‑latency all‑to‑all communication.

Jonathan Gray from Anthropic demonstrated kernel optimizations that achieve 60 % tensor‑engine utilization on Trainium 2 and over 90 % on Trainium 3, while serving the majority of Claude traffic on nearly a million Trainium 2 chips. The session also covered usability enhancements such as native PyTorch support, the open‑source NKI compiler, and the Neuron Explorer profiling tool with nanosecond‑level observability.

Introduction

Joe Senerchia, EC2 product manager for Trainium, opened the session, introducing the two experts:

  • Ron Diamant – Chief Architect of Trainium
  • Jonathan Gray – Trainium inference lead for Anthropic

Agenda

  1. How AWS builds AI infrastructure.
  2. Ron’s deep dive into Trainium’s performance, scale, and usability.
  3. Jonathan’s walkthrough of kernel optimizations for Trainium.

The AI Revolution

Why the current AI buzz? It represents a tectonic shift in how we build, deploy, and interact with technology. As Andy Jassy put it:

“We are at the beginning of the biggest technological transformation of our lifetime.”

AI is no longer an incremental improvement; it unlocks entirely new capabilities across scientific domains.

AI in Scientific Domains

  • Protein biology – Models now predict and design new proteins in minutes, a task that previously took hours.
  • Mathematics – Systems like AlphaGeometry compete at Olympiad level and solve formal proofs.
  • Software engineering – AI can write, debug, and reason across large codebases, becoming a breakthrough force in its own right.

These advances are driving scientific discovery rather than merely supporting it.

AI‑Driven Software Engineering

Traditional programming has evolved into:

  • Code completions and chat‑based programming.
  • Benchmarks such as SWE‑Bench Lite (≈ 80 % of real GitHub issues solved) and SWE‑Bench Verified (≈ 50 % solved with full correctness).

The next phase involves autonomous agent fleets that can operate at scale, collaborating with human developers to accelerate feature delivery.

AWS AI Infrastructure Stack

AWS has spent over a decade building a comprehensive, deeply integrated AI stack:

LayerOffering
ComputeNVIDIA GPU instances, Inferentia, Trainium (2 & 3)
NetworkUltra clusters scaling to tens‑or‑hundreds of thousands of chips via low‑latency Elastic Fabric Adapter (EFA)
StorageHigh‑throughput FSx for Lustre, S3 Express One (≈ 10× faster data access)
SecurityNitro system for workload isolation and data protection
Management & ObservabilityCloudWatch, Neuron Explorer, other tooling

Silicon development underpins the stack: multiple Nitro generations, Graviton CPUs, and early AI‑specific chips (Inferentia 2019, Trainium 2 2023).

Trainium 2 Recap

  • Compute: ~1,300 peta‑FLOPS of dense FLOPS.
  • Server: First Trainium 2 Ultra server, scaling up to 64 chips via NeuronLink (1 TB/s connectivity).
  • Network: Tens of thousands of chips linked with EFA.

Engineering improvements reduced the time from silicon receipt to customer delivery by 70 %, enabling a 4× faster ramp‑up and a 33× larger capacity footprint compared to prior AWS AI instances.

Trainium 3 Ultra Server

  • Compute: 362 peta‑FLOPS (FP16) per chip.
  • Memory: 20.7 TB of HBM3E per chip.
  • Hardware innovations:
    • Microscaling circuits for power‑efficient performance.
    • Accelerated softmax instructions that keep performance near peak.
  • Scalability: Up to 144 chips per Ultra server, interconnected by NeuronSwitches for low‑latency all‑to‑all communication.

These advances target agentic workloads and large‑scale reasoning models.

Kernel Optimizations (Anthropic)

Jonathan Gray demonstrated real‑world kernel tuning on Trainium:

  • Trainium 2 – Achieved ~60 % tensor‑engine utilization.
  • Trainium 3 – Pushed utilization > 90 %, delivering higher throughput for Claude models.

Anthropic currently runs the majority of Claude traffic on nearly 1 million Trainium 2 chips, showcasing the platform’s massive scale.

Ease‑of‑Use Improvements

  • Native PyTorch support – Simplifies model development and training.
  • Open‑source NKI compiler – Enables custom kernel generation and optimization.
  • Neuron Explorer – Profiling tool offering nanosecond‑level observability for performance debugging.

These tools lower the barrier to adopting Trainium for both training and inference workloads.

Back to Blog

Related posts

Read more »