AWS re:Invent 2025 - Unleashing Generative AI for Amazon Ads at Scale (AMZ303)

Published: 1 hour ago (December 5, 2025 at 11:46 AM EST)

3 min read

Source: Dev.to

Overview

In this session, Amazon Ads demonstrates how they built a large‑scale LLM inference system on AWS to process billions of daily requests for understanding shoppers and products. The architecture runs on Amazon ECS with over 10,000 GPUs and incorporates optimizations such as disaggregated inference and KV‑aware routing through NVIDIA Dynamo, delivering ≈ 50 % throughput improvement and 20‑40 % latency reduction. Three inference patterns are covered: offline batch, near‑real‑time, and real‑time, each tuned for latency requirements ranging from seconds to milliseconds. GPU capacity is dynamically allocated based on traffic patterns to maximize utilization.

Amazon Ads & AWS Relationship

Customer & Partner – Amazon Ads consumes AWS services to build its own products and also partners with AWS to bundle solutions for joint customers.
History – The first Amazon advertising product launched over a decade ago; the business now runs entirely in the cloud.
Recent AI Products – Creative Agent and Ads Agent, both agentic solutions built on AWS.

Services Used

Amazon Ads leverages a broad set of AWS services, including:

Compute & storage: EC2, S3
Orchestration: Step Functions, EMR
Machine learning: SageMaker, Bedrock
Container orchestration: Amazon ECS (with GPU support)

The team interacts with more than 180 different AWS services, providing feedback that helps mature those services for all customers.

Use Cases & Business Context

Shopper Journey Example – Halloween Costumes

Brand Awareness – When a shopper searches “Halloween costumes,” a brand recommendation and featured products are shown.
Consideration – As the shopper browses, more relevant products and attribute‑based groups appear in the results.
Purchase Decision – After clicking a product, complementary or alternative items are suggested.

Scale – Each query must retrieve a relevant subset from hundreds of millions of items (tens of thousands), score them with ML models, and narrow down to a few hundred candidates for final ranking.
Volume – Billions of such requests are processed daily.

Model Architecture

Inputs: query text, search context, product features, shopper signals.
Neural network: typically includes attention mechanisms or mixture‑of‑experts designs.
Output: probability score indicating likelihood of click or purchase.
Size: models often contain billions of parameters; inference requires tens of billions of operations per request.

Inference Patterns

Pattern	Latency Target	Typical Use
Offline Batch	Seconds to minutes	Large‑scale scoring for catalog updates
Near Real‑Time	Hundreds of milliseconds	Daily/weekly refreshes, personalized recommendations
Real‑Time	< 50 ms	Immediate response to shopper queries

Optimizations

Disaggregated Inference – Separates model loading from compute, allowing faster scaling of GPU workers.
KV‑Aware Routing (NVIDIA Dynamo) – Routes requests to the GPU node that already holds the required model weights, reducing data transfer overhead.
Dynamic GPU Allocation – Monitors traffic patterns and scales GPU capacity up or down to maintain high utilization while meeting latency SLAs.

These techniques collectively achieve:

≈ 50 % higher throughput compared with a naïve deployment.
20‑40 % reduction in latency across all inference patterns.

Lessons Learned

Determinism at Scale – Ensuring consistent inference results while handling billions of requests requires careful orchestration of model versions and routing logic.
Feedback Loop with AWS – Continuous collaboration with AWS service teams accelerates feature development (e.g., GPU scheduling, storage optimizations).
Resource Utilization – Dynamic capacity management is essential to balance cost and performance, especially during traffic spikes such as Prime Day events.

Conclusion

Amazon Ads has built a robust, large‑scale LLM inference platform on AWS that powers real‑time shopper understanding across billions of daily requests. By leveraging a mix of managed services, custom optimizations, and dynamic resource allocation, the solution delivers high throughput and low latency, enabling more relevant advertising experiences for shoppers and better outcomes for advertisers.

AWS re:Invent 2025 - Unleashing Generative AI for Amazon Ads at Scale (AMZ303)

Overview

Amazon Ads & AWS Relationship

Services Used

Use Cases & Business Context

Shopper Journey Example – Halloween Costumes

Model Architecture

Inference Patterns

Optimizations

Lessons Learned

Conclusion

Related posts

Building a Multi-Agent Ghost Story: How Kiro’s Hybrid Development Changed Everything

AWS re:Invent 2025 - Deep Dive: ECS Managed Instances & Blue/Green for Resilient Services (CNS416)

Mejora en PLD gracias a IA/ML: Una historia de éxito

Turbocharge Your Optimization: Preconditioning for the Win

Overview

Amazon Ads & AWS Relationship

Services Used

Use Cases & Business Context

Shopper Journey Example – Halloween Costumes

Model Architecture

Inference Patterns

Optimizations

Lessons Learned

Conclusion

Related posts

Building a Multi-Agent Ghost Story: How Kiro’s Hybrid Development Changed Everything

AWS re:Invent 2025 - Deep Dive: ECS Managed Instances & Blue/Green for Resilient Services (CNS416)

**Mejora en PLD gracias a IA/ML: Una historia de éxito**

Turbocharge Your Optimization: Preconditioning for the Win

Mejora en PLD gracias a IA/ML: Una historia de éxito