AWS re:Invent 2025 - Building scalable applications with text and multimodal understanding (AIM375)
Source: Dev.to
Overview
AWS re:Invent 2025 – Building scalable applications with text and multimodal understanding (AIM375)
In this session Amazon AGI introduced Amazon Nova 2.0 multimodal foundation models that process text, images, video, audio, and speech natively. The talk covered three key areas:
- Document intelligence – optimized OCR and key‑information extraction.
- Image and video understanding – temporal awareness and reasoning capabilities.
- Amazon Nova Multimodal Embeddings – cross‑modal search across all content types.
Box’s Tyan Hynes demonstrated real‑world use cases, such as automated materials‑testing report analysis for engineering firms and continuity checks for production studios, highlighting the 1 million‑token context window and native multimodal processing that eliminate the need for separate models and manual annotation workflows.
This article is auto‑generated from the original presentation; minor typos or inaccuracies may be present.
Enterprise Challenges with Multimodal Data
Good morning, I’m Dinesh Rajput, Principal Product Manager, Amazon AGI, and together with Brandon Nair and Box’s Tyan Hynes we’ll explore how to leverage multimodal data—images, documents, video, audio, and call recordings—to build accurate, context‑aware applications.
Agenda
- Enterprise needs and challenges for multimodal data.
- Overview of Amazon Nova 2.0 models.
- Deep dive: document‑intelligence optimizations.
- Deep dive: visual‑reasoning use cases.
- Multimodal embeddings for enterprise‑wide search.
- Customer success story – Box.
The Data Landscape
Organizations possess massive amounts of data: text, structured records, contracts, videos, and call recordings. Yet most AI workflows only use a small slice—typically text or structured fields. Multimodal foundation models enable you to:
- Extract content from images.
- Understand events across video frames.
- Capture sentiment and intent from audio or speech.
By reasoning over all modalities together, you can generate richer customer insights and streamline AI pipelines.
Key Challenges
- Fragmented toolchains – separate models for text, images, video, etc., leading to costly, complex pipelines.
- Context integration – difficulty stitching together insights from different modalities (e.g., a document and a support call).
- Accuracy & scalability – many models lack sufficient precision, forcing human‑in‑the‑loop review and limiting scale.
Introducing Amazon Nova 2.0: Natively Multimodal Foundation Models
Amazon Nova 1.0 launched at last year’s re:Invent and quickly gained tens of thousands of customers. Building on that feedback, Amazon Nova 2.0 treats every modality as a first‑class citizen, offering native processing of text, images, video, audio, and speech, as well as generation capabilities.
Model Portfolio
| Model | Primary Use‑Case | Highlights |
|---|---|---|
| Nova 2 Lite | Fast, cost‑effective reasoning for most workloads | Low latency, economical |
| Nova 2 Pro | Complex, high‑accuracy tasks | Highest accuracy |
| Nova 2 Omni | Unified understanding & generation across all modalities | Text ↔ image ↔ video ↔ audio generation |
| Nova 2 Sonic | Conversational speech‑to‑speech with low latency | Real‑time voice interactions |
| Nova Multimodal Embeddings | Cross‑modal search & retrieval | Single embedding space for all data types |
These models support a 1 million‑token context window, enabling long‑form reasoning across extensive multimodal inputs without breaking the context.
Highlighted Applications
- Automated materials‑testing report analysis – extracting key metrics from PDFs, images, and video of test rigs.
- Continuity checks for production studios – verifying visual and audio consistency across episodes.
Both use cases demonstrate how a single multimodal model can replace a suite of specialized tools, reducing cost, complexity, and the need for manual annotation.
References & Resources
- Session video: AWS re:Invent 2025 – Building scalable applications with text and multimodal understanding (AIM375)
- Amazon Nova product page: https://aws.amazon.com/amazon-nova/