AWS re:Invent 2025 - [NEW LAUNCH] Amazon Nova 2 Omni: A new frontier in multimodal AI (AIM3324)
Source: Dev.to
Introduction
AWS re:Invent 2025 introduced Amazon Nova 2 Omni, a unified multimodal AI model that can understand text, images, video, and audio while also generating high‑quality images. The session highlighted Omni’s superior performance in document understanding, OCR, audio transcription with three‑speaker diarization, and cross‑modal reasoning. Key capabilities include hybrid reasoning control, support for 200+ languages, and up to one million token context windows. Benchmarks show competitive results against Gemini 2.5 Flash and GPT‑4.
Agenda
- Overview of the Amazon Nova family of models.
- Recap of the Amazon Nova 2 launch announced by Matt Garman.
- Deep dive into Amazon Nova 2 Omni – multimodal understanding and generation.
- Demonstrations and example use cases.
- Performance comparison with leading models.
- Dentsu Digital’s real‑world applications presented by Chief AI Officer Satoru Yamamoto.
Amazon Nova Family Overview
Nova Understanding Models
- Accept text, images, and video.
- Provide metadata extraction, summarization, Q&A, and text generation.
- Available in Micro, Light, Pro, and the largest Premier tier.
Generative Models
- Nova Canvas – image generation.
- Nova Real – video generation.
Speech Models
- Nova Sonic – speech‑to‑speech for real‑time conversational AI (e.g., customer support).
Multimodal Embedding Model
- First native multimodal embedding model for semantic search and agentic RAG across documents, images, video, audio, and text.
These models are already used by tens of thousands of enterprises and startups.
Nova 2 Family Launch
Nova 2 Lite
- Fast, cost‑effective reasoning model for everyday workloads.
- Hybrid reasoning: developers can enable or disable reasoning per task, conserving tokens and latency when reasoning isn’t needed.
Nova 2 Pro (preview)
- Higher‑performance tier for complex tasks such as coding, multi‑agent scenarios, and advanced reasoning.
Nova 2 Omni (preview)
- Unified model for multimodal reasoning and image generation.
- First Bedrock model that can ingest any modality—including audio/speech—and generate high‑quality images within the same model.
- First industry reasoning model capable of cross‑modal reasoning and image generation in a single system.
Nova 2 Sonic (second generation)
- Improved performance, broader language support, more natural conversational experience, and additional voice options.
All four models support up to one million input tokens and 200+ languages for text. Nova 2 Pro and Omni also understand audio in up to 10 languages.
Amazon Nova 2 Omni Details
Core Properties
- Hybrid Reasoning: Developers control the level of reasoning or disable it entirely.
- Multimodal Input: Accepts text, images, video, and audio; can generate both text and images.
- Unified Architecture: Eliminates the need for multiple specialized models, reducing pipeline complexity, build costs, and time‑to‑market.
Typical Use Cases
- Instruction following, tool calling, and standard NLP tasks (sentiment analysis, classification).
- Document, image, video, and audio understanding.
- Cross‑modal reasoning (e.g., answering questions about a video clip while generating a related illustration).
Performance
- State‑of‑the‑art multimodal perception, optimized for any multimodal workload.
- Benchmark results published in the technical report show competitive or superior scores compared to Gemini 2.5 Flash and GPT‑4 across OCR, transcription, and multimodal reasoning tasks.
Dentsu Digital Use Case
Satoru Yamamoto presented several applications built with Nova 2 Omni under the Mugen AI solution:
- Video Creative Prediction – achieved a 0.88 correlation accuracy in forecasting creative performance.
- Automated Storyboard Generation – produced storyboards with accurate Japanese character rendering.
- AI‑Powered Workflow Automation (Nova Act) – streamlined end‑to‑end processes.
Dentsu developed seven applications in seven days, cutting development time from three months to one day per solution, demonstrating the model’s practical efficiency for enterprise deployment.
Conclusion
Amazon Nova 2 Omni represents a significant step toward fully multimodal AI, mirroring how humans interact through speech, visuals, and text. Its hybrid reasoning, massive context window, and ability to generate images alongside text position it as a versatile foundation model for a wide range of business applications.