AWS re:Invent 2025 - Unleashing Generative AI for Amazon Ads at Scale (AMZ303)

Published: (December 5, 2025 at 11:46 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Overview

In this session, Amazon Ads demonstrates how they built a large‑scale LLM inference system on AWS to process billions of daily requests for understanding shoppers and products. The architecture runs on Amazon ECS with over 10,000 GPUs and incorporates optimizations such as disaggregated inference and KV‑aware routing through NVIDIA Dynamo, delivering ≈ 50 % throughput improvement and 20‑40 % latency reduction. Three inference patterns are covered: offline batch, near‑real‑time, and real‑time, each tuned for latency requirements ranging from seconds to milliseconds. GPU capacity is dynamically allocated based on traffic patterns to maximize utilization.

Amazon Ads & AWS Relationship

  • Customer & Partner – Amazon Ads consumes AWS services to build its own products and also partners with AWS to bundle solutions for joint customers.
  • History – The first Amazon advertising product launched over a decade ago; the business now runs entirely in the cloud.
  • Recent AI ProductsCreative Agent and Ads Agent, both agentic solutions built on AWS.

Services Used

Amazon Ads leverages a broad set of AWS services, including:

  • Compute & storage: EC2, S3
  • Orchestration: Step Functions, EMR
  • Machine learning: SageMaker, Bedrock
  • Container orchestration: Amazon ECS (with GPU support)

The team interacts with more than 180 different AWS services, providing feedback that helps mature those services for all customers.

Use Cases & Business Context

Shopper Journey Example – Halloween Costumes

  1. Brand Awareness – When a shopper searches “Halloween costumes,” a brand recommendation and featured products are shown.
  2. Consideration – As the shopper browses, more relevant products and attribute‑based groups appear in the results.
  3. Purchase Decision – After clicking a product, complementary or alternative items are suggested.
  • Scale – Each query must retrieve a relevant subset from hundreds of millions of items (tens of thousands), score them with ML models, and narrow down to a few hundred candidates for final ranking.
  • Volume – Billions of such requests are processed daily.

Model Architecture

  • Inputs: query text, search context, product features, shopper signals.

  • Neural network: typically includes attention mechanisms or mixture‑of‑experts designs.

  • Output: probability score indicating likelihood of click or purchase.

  • Size: models often contain billions of parameters; inference requires tens of billions of operations per request.

Inference Patterns

PatternLatency TargetTypical Use
Offline BatchSeconds to minutesLarge‑scale scoring for catalog updates
Near Real‑TimeHundreds of millisecondsDaily/weekly refreshes, personalized recommendations
Real‑Time< 50 msImmediate response to shopper queries

Optimizations

  1. Disaggregated Inference – Separates model loading from compute, allowing faster scaling of GPU workers.
  2. KV‑Aware Routing (NVIDIA Dynamo) – Routes requests to the GPU node that already holds the required model weights, reducing data transfer overhead.
  3. Dynamic GPU Allocation – Monitors traffic patterns and scales GPU capacity up or down to maintain high utilization while meeting latency SLAs.

These techniques collectively achieve:

  • ≈ 50 % higher throughput compared with a naïve deployment.
  • 20‑40 % reduction in latency across all inference patterns.

Lessons Learned

  • Determinism at Scale – Ensuring consistent inference results while handling billions of requests requires careful orchestration of model versions and routing logic.
  • Feedback Loop with AWS – Continuous collaboration with AWS service teams accelerates feature development (e.g., GPU scheduling, storage optimizations).
  • Resource Utilization – Dynamic capacity management is essential to balance cost and performance, especially during traffic spikes such as Prime Day events.

Conclusion

Amazon Ads has built a robust, large‑scale LLM inference platform on AWS that powers real‑time shopper understanding across billions of daily requests. By leveraging a mix of managed services, custom optimizations, and dynamic resource allocation, the solution delivers high throughput and low latency, enabling more relevant advertising experiences for shoppers and better outcomes for advertisers.

Back to Blog

Related posts

Read more »