Building the Next Generation of Voice Agents with Strands

Published: (January 8, 2026 at 06:35 AM EST)
7 min read
Source: Dev.to

Source: Dev.to

Intro

In today’s hyper‑paced tech landscape, new frameworks drop almost daily. The real challenge isn’t just keeping up—it’s deciding which tools actually deserve your deep‑dive time.

If you’ve been hearing buzz about Strands Agents and their new bidirectional streaming (the BidiAgent), this guide is for you. I’ll break down what this feature is in simple terms, explore real‑world examples like real‑time voice assistants, and honestly weigh the disadvantages. By the end, you’ll know exactly if this is the right fit for your next high‑concurrency project.

Voice‑controlled concierge

Strands Agents

Strands Agents is an open‑source SDK developed by AWS that simplifies building scalable AI agents. While it originates in the AWS ecosystem, it isn’t a “walled garden.” You aren’t restricted to Amazon Bedrock; the framework is fully model‑agnostic, meaning you can integrate it with other cloud providers or even run it locally using Ollama.

Strands Agents Bidirectional Streaming (Experimental)

Typically we interact with AI through a “ping‑pong” text exchange: you send a message, wait, and the agent replies. The new Bidirectional Streaming feature (currently experimental) flips this script.

Imagine a conversation that feels… well, human. By leveraging full‑duplex communication, you can now interact with agents via voice in real time.

  • Unlike traditional setups that chain separate Speech‑to‑Text and Text‑to‑Speech models—often feeling laggy—Strands utilizes native speech‑to‑speech models like Amazon Nova Sonic.
  • This reduces latency and cost, allowing the agent to listen and speak simultaneously.
  • The result? You can finally interrupt your AI assistant (a feature called “barge‑in”) just like you would a friend in a natural conversation.

Use Cases

Where does bidirectional streaming move from “cool tech” to an essential tool? Below are two high‑impact scenarios where the Strands BidiAgent transforms a frustrating task into a seamless conversation.

Use Case 1 – The Parking‑Location Assistant

The Problem – “Machine Interface” Barrier
We’ve all been there: wandering through a massive, multi‑level parking garage, completely forgetting where we left the car. Some high‑end malls have digital kiosks, but the experience is often frustrating. You have to locate the machine, navigate a clunky touchscreen UI, and manually type in your license plate. It feels like a cold, mechanical interaction that forces you to stop and “talk” to a computer on its terms.

The Solution – Conversational Environment
Because of bidirectional streaming, you can simply speak to the system as if a helpful concierge were standing right next to you. The interaction is fluid, real‑time, and—most importantly—doesn’t feel like a transaction with a machine.

Sample Conversation

User: “I’m completely lost—can you help me find my car?”
System: “I can certainly try! What is your license plate number?”
User: “It’s EHU 62E.”
System: “Got it. You’re actually on the opposite side of the mall, so you have a bit of a walk ahead of you. Take the elevator to your right, then—”
User (interrupting): “Wait, the elevator near the Starbucks?”
System: “Exactly! Go past the Starbucks, turn left at the exit, and your car will be the fifth one on your right.”

Why this is a game‑changer – In a traditional AI setup, the system would have to finish its long set of directions before you could ask for clarification. With the Strands BidiAgent, the system “hears” your interruption instantly and pivots the conversation, turning a rigid database query into a helpful, human‑like interaction.

How you choose to bring this conversation to life—whether through physical installations, integrated audio, or custom hardware—is where the real innovation happens.

Use Case 2 – The Interactive Mall Directory

The Problem – “Static Maze”
Most malls still rely on static or semi‑interactive “You Are Here” boards. Navigating these feels like using a paper map in a world where we expect GPS. You have to find your orientation, scan a list of 200 shops, and then mentally map out a path. It’s high‑friction and often leads to more confusion.

The Solution – Upgrading the Edge with BidiAgents
Instead of replacing the entire board, we “upgrade” it for 2026. By installing a System‑on‑Chip (SoC) like a Jetson Nano and adding a simple microphone/speaker array, we can transform that static board into a voice‑first assistant powered by a Strands BidiAgent.

Sample Conversation

User: “I’m looking for Starbucks… I think it’s on the third floor?”
System (responding instantly while the user pauses): “Actually, you’re in luck! It’s much closer. Just take the elevator to your right to the second floor.”
User: “Wait, the one near the fountain?”
System: “Exactly. Once you step out, turn left and it’ll be right in front of you.”

Why this wins – By running a speech‑to‑speech (S2S) model at the edge or as a low‑latency stream, the “map” becomes a proactive guide. It eliminates the need for touchscreens (more hygienic and accessible) and provides a human‑first interface in a machine‑driven environment.

I’ve actually built a prototype to show you how this looks in the real world. Below is a working example of this exact use case.

Mall Assistant in Action – Leveraging Strands and a Bidirectional Audio Loop

Working Example: Shop Location Assistant

import asyncio
from strands.experimental.bidi import BidiAgent, BidiAudioIO
from strands.experimental.bidi.io import BidiTextIO
from strands.experimental.bidi.models import BidiNovaSonicModel
from strands import tool
from strands_tools import calculator, current_time

# Create a bidirectional streaming model
model = BidiNovaSonicModel()

# Define a custom tool
@tool
def get_shop_location(shop: str) -> str:
    """
    Get the shop location for a given shop name.

    Args:
        shop: Name of the shop to locate

    Returns:
        A string with the directions to find the shop.
    """
    print("get_shop_location called with shop:", shop)
    # In a real application, call the location API that returns these instructions
    locations = {
        "starbucks": (
            "Take the elevator at your right, go to the second floor and turn left "
            "in that hall – you will find it on the right."
        ),
        "apple store": (
            "Go straight ahead from the main entrance, take the escalator to the "
            "first floor, and it's on your left."
        ),
        "food court": (
            "Head to the centre of the mall, take the stairs to the third floor, "
            "and you'll see it right in front of you."
        ),
        "bookstore": (
            "From the main entrance, turn right and walk past the clothing stores; "
            "it's next to the toy store."
        ),
    }
    if shop.lower() in locations:
        print("Found location for shop:", shop)
        return locations[shop.lower()]
    else:
        return "Sorry, we don't have that shop in the mall."

# Create the agent
agent = BidiAgent(
    model=model,
    tools=[calculator, current_time, get_shop_location],
    system_prompt=(
        "You are a mall assistant that helps people find any shop in a mall. "
        "Keep responses concise and natural."
    ),
)

# Set up audio I/O for microphone and speakers
audio_io = BidiAudioIO()
text_io = BidiTextIO()

# Run the conversation
async def main():
    await agent.run(
        inputs=[audio_io.input()],
        outputs=[audio_io.output(), text_io.output()],
    )

asyncio.run(main())

Note: The shop locations are hard‑coded inside the tool. In this use‑case that’s a deliberate choice: the mall’s map is static, so hard‑coding gives 100 % accuracy and near‑zero latency. While you could hook into an external API or use a Retrieval‑Augmented Generation (RAG) approach with a digital map, those methods add cost and increase the risk of the model “hallucinating” directions. For a high‑traffic mall, a simple, local “source of truth” is often the most robust solution.

Demo Video: (embed or link the video here)

The Challenge: Tackling Environmental Noise

No experimental feature is without its growing pains. During testing I hit a significant hurdle: white‑noise sensitivity.

Bidirectional streaming is designed for natural conversation and constantly listens for a “barge‑in” (when a user interrupts the AI). If the computer fan kicked in, the built‑in microphone picked up that hum as an interruption, causing the agent to stop mid‑sentence.

Technical Note on Nova Sonic Versions

  • Current State: At the time of writing, the Strands integration is primarily optimized for nova‑sonic‑v1. This version lacks granular settings to adjust the “interruption threshold.”
  • Future: The upcoming nova‑sonic‑v2 promises better configurations for noise suppression and sensitivity.

What to do for real‑world deployments (e.g., our mall assistant):

  1. Use high‑quality directional microphones, or
  2. Wait for the broader integration of Nova Sonic v2, or
  3. Switch to a provider that already offers adjustable sensitivity (e.g., OpenAI’s real‑time voice models).

Conclusion: The Future Is Conversational

We’re moving away from clicking buttons toward natural dialogue. While bidirectional streaming is still experimental—especially regarding sensitivity—the potential to humanize technology is immense. From smarter mall directories to interactive industrial assistants, the shift from “Text‑In/Text‑Out” to “Live Conversation” is the next frontier.

The complexity of these implementations—especially when dealing with hardware like a Jetson Nano or tuning model sensitivity—is where the real work begins. If you’re curious about bringing this “human experience” to your hardware or project, let’s talk. I’m actively exploring these architectures and would love to help you navigate the nuances of building your next intelligent agent.

Back to Blog

Related posts

Read more »