AWS re:Invent 2025 - Create hyper-personalized voice interactions with Amazon Nova Sonic (AIM374)

Published: (December 5, 2025 at 08:46 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Overview

In this session, Veerdhawal Pande (Principal Product Manager, Amazon Nova Sonic) and Ankur Gandhe (Principal Scientist, Amazon General Intelligence) introduce Amazon Nova 2 Sonic, a speech‑to‑speech foundation model designed for real‑time, human‑like conversational AI. The presentation covers core features, benchmark results, architecture, key use cases, developer tools, and a live demo of a customer implementation.

Core Features

  • Bidirectional streaming API on Amazon Bedrock with low user‑perceived latency, enabling real‑time audio input and output.
  • Best‑in‑class speech understanding, robust to diverse speaking styles, accents, and background noise.
  • Turn‑taking controllability: developers can configure pause durations that define the end of a turn, allowing natural interruptions and pauses.
  • Sentiment awareness: the model detects tonality and voice‑based sentiment, adapting responses to mirror user emotions.
  • Tool calling & knowledge grounding: responses can be backed by external knowledge bases or invoke APIs (e.g., reservations, membership upgrades) while maintaining factual correctness.
  • Privacy‑first design: built with responsible AI safeguards and compliance features.

New Capabilities in Nova 2 Sonic

Expanded Language Support

  • Seven languages, each with masculine and feminine voices:
    • English (US, GB, India, Australia) – new Indian and Australian variants
    • Spanish
    • French
    • Italian
    • German
    • Hindi
    • Portuguese

Language Switching

  • Users can switch languages mid‑session; the model responds in the newly selected language without restarting the conversation.

Asynchronous Task Completion

  • Long‑running tool calls (e.g., 8–10 seconds) no longer block the dialogue. Users can continue the conversation or invoke other tools while previous calls complete in the background.

Cross‑Modal Input/Output

  • Supports speech ↔ speech, speech ↔ text, and text ↔ text within the same session, preserving conversational context across modalities.

Turn‑Taking Controllability

  • Developers can set the maximum pause length before the model assumes the user’s turn has ended, improving fluidity in multi‑turn dialogues.

Architectural Highlights

Ankur Gandhe explains that the unified speech‑to‑speech model replaces traditional cascaded pipelines (ASR → NLU → TTS) with a single end‑to‑end network. Benefits include:

  • Reduced latency and error propagation.
  • Consistent voice characteristics across input and output.
  • Simplified deployment and scaling via Bedrock’s managed service.

Benchmark Results

  • Speech understanding accuracy improves by ~50 % on alphanumeric content compared to the previous generation.
  • Achieves state‑of‑the‑art ASR performance on public benchmarks.
  • Demonstrates competitive quality‑price ratio, positioning Nova 2 Sonic as a cost‑effective solution for enterprise workloads.

Key Use Cases

  • Customer service automation (e.g., call center agents, Amazon Connect).
  • Voice assistants for consumer and enterprise applications.
  • Education apps that require interactive, multimodal dialogue.
  • AI receptionists and appointment booking systems.

Integration Options

Developers can integrate Nova 2 Sonic through:

  • LiveKit and Pipecat SDKs for real‑time streaming.
  • Amazon Connect for contact‑center workflows.
  • Telephony partners such as Twilio for PSTN connectivity.

Customer Demo: AI Receptionist

Amma Pandekar from Cisco showcases a real‑world implementation for a tire‑chain retailer:

  • An AI receptionist handles appointment scheduling, answers queries, and switches between voice and text as needed.
  • Demonstrates multi‑modal dialogue and language switching in a live call.

Getting Started

  1. Enable Nova 2 Sonic in the desired Bedrock region (IAD, PDX, ARN, NRT).
  2. Use the bidirectional streaming API to send audio or text and receive real‑time responses.
  3. Configure turn‑taking parameters and language preferences via the API payload.
  4. Leverage tool‑calling by defining callable functions in your application logic.

For detailed integration guides, see the Amazon Bedrock documentation and the SDK references for LiveKit and Pipecat.

This content reflects the material presented at AWS re:Invent 2025 (session AIM374).

Back to Blog

Related posts

Read more »