Beyond Dictation: Building Software Just by Talking

Published: (February 23, 2026 at 08:40 PM EST)
9 min read
Source: Dev.to

Source: Dev.to

TL;DR: Kiro Steering Studio is a voice‑powered tool that generates structured Kiro steering files through natural conversation — not dictation. Built on Amazon Nova 2 Sonic’s bidirectional streaming, it routes what you say to the right files, tracks open questions, and produces AI‑optimized markdown context for your workspace. This post covers how it’s built, why it’s different from the voice‑AI tools you already use, and what I learned along the way.

Voice is becoming a first‑class interface in developer tooling

The voice‑to‑text space has exploded in 2026, with several products competing to make typing obsolete. Some treat it as a faster way to input text, while others use it to drive structured, agentic workflows.

The Voice AI Landscape for Devs in Q1 2026

  • OpenAI – Codex
    Voice is a supported feature, but the scope is intentionally narrow.
    The official Codex macOS app (released Feb 2026) includes voice commands that let developers speak prompts directly into the agent interface. The VS Code extension similarly supports voice‑to‑text dictation for entering instructions. In both cases, voice is a prompt‑delivery mechanism: you speak a task, Codex executes it in an isolated sandbox, and proposes a PR. The voice interface doesn’t change the interaction model; it merely removes your keyboard from the loop.

  • Anthropic – Claude
    Anthropic has introduced an official Voice Mode for the general Claude app (mobile & web). Voice capabilities for Claude Code are largely based on community‑developed, third‑party integrations.

  • Cursor
    Cursor 2.0 introduced official voice support. Voice Mode lets you control the editor and its AI features with spoken commands such as “open file app.ts”, “extract function”, or “refactor this to use async/await”. The AI drafts a patch in response. This is a meaningful step beyond pure dictation because spoken instructions can trigger multi‑step edits.

  • SuperWhisper & WisprFlow
    These sit at the other end of the spectrum—general‑purpose dictation tools that developers adopt for everything from crafting prompts to drafting documentation. WisprFlow wins on seamless “flow” with auto‑edits that make dictation feel natural. Both integrate via keyboard shortcuts and excel at transcription.

All of these tools validate the same insight: voice is faster and more natural than typing. However, they all act as input mechanisms.

When you use any of these tools to build software, you’re still doing the cognitive work of:

  • Structuring information into the right format
  • Maintaining consistency in terminology and conventions
  • Organizing content into logical sections

You might speak faster than you type, but you’re still manually authoring markdown files.

What Kiro Steering Files Actually Do

Before explaining how Kiro Steering Studio works, it’s worth understanding what steering files are and why they matter. At its core, steering gives Kiro persistent knowledge about your workspace through markdown files. Instead of explaining your conventions in every chat, steering files ensure Kiro consistently follows your established patterns, libraries, and standards.

Kiro Steering Files

The three core files that capture project context

FilePurpose
product.mdDefines what you’re building: a one‑liner, target users, MVP journeys & features, non‑goals, success metrics, and a domain glossary.
tech.mdDefines how to build it: frontend stack, backend approach, authentication, data storage, IaC, observability, and styling guide.
structure.mdDefines project organization: repository layout, naming conventions, import patterns, architecture patterns, and testing approach.

Writing these by hand is tedious. Kiro offers an “easy‑button” to auto‑generate them if you already have a well‑established codebase, but that’s not the case when you’re building a new application from scratch.

How Steering Studio Is Different

Kiro Steering Studio treats voice as an interface to structured knowledge generation, not just simple transcription. Instead of manually writing steering files, you talk about your project. The AI asks clarifying questions, probes for details you might have overlooked, and generates properly structured steering files in real time. The conversation becomes the documentation.

Conversational Extraction

You have a natural conversation instead of dictating pre‑structured content. Example:

“I’m building a task‑management app for internal engineering teams using React with TypeScript and Node.js.”

The AI doesn’t just transcribe verbatim; it asks clarifying questions you might not have considered:

  • “What’s your state‑management approach — Redux, React Query, or Context API?”
  • “How do you handle authentication?”
  • “What’s your testing approach?”
  • “Should authentication use OAuth or magic links?”

Each answer updates the appropriate steering file as the conversation probes for completeness.

Intelligent Routing

The AI understands where information belongs:

  • Mentioning “React with TypeScript” automatically updates the frontend section of tech.md.
  • Describing user journeys populates product.md.
  • Explaining your directory structure updates structure.md.

Active Gap Detection

The AI tracks what’s missing. If you haven’t specified your frontend stack or naming conventions, it logs open questions and prompts you to resolve them. When you answ(the original content ends here; the sentence is intentionally left incomplete to preserve the source material).

How It’s Built

The architecture splits into four concerns: streaming, session management, steering state, and tool handling.

Architecture Diagram: Kiro Steering Studio

NovaSonicClient: Bidirectional Audio Streaming

At the center of our app is real‑time, bidirectional streaming with Amazon Bedrock, specifically with Nova 2 Sonic – Amazon’s speech‑to‑speech foundation model.

Unlike a request‑response flow where you record speech, send it as one request, wait for a response, then execute tool calls in a batch, Nova 2 Sonic processes audio as you speak and interleaves tool execution with the conversation.

Traditional voice‑AI flow

  1. Record all speech
  2. Send complete audio
  3. Wait for response
  4. Execute tool calls
  5. Send results
  6. Wait for final response

Bidirectional streaming flow

  1. Audio streams continuously – no waiting for speech to finish
  2. Model responds while you’re still talking
  3. Tool calls happen mid‑conversation, not after
  4. Results flow back immediately; model continues speaking

Audio buffers queue up (max 220 chunks) and are processed in batches of five to prevent overwhelming the stream. When the queue fills under pressure, old chunks are shed to maintain real‑time responsiveness.

The client handles the session lifecycle—start, audio content, prompts, tool results, and graceful shutdown—through a state machine that tracks which events have been sent.

Tool System: Synchronous Execution, Interleaved with Speech

Tool calls don’t wait until you finish speaking. The model might be mid‑sentence describing your project, realize it should update the product steering, emit a toolUse event, get the result back, and continue talking. This happens through the toolResult event handler:

session.onEvent('toolEnd', async (d: unknown) => {
  const toolData = d as ToolEndData;
  const result = runTool(store, toolData.toolName, toolData.toolUseContent); // Synchronous
  await sonic.sendToolResult(socket.id, toolData.toolUseId, result);          // Send back to model
});

Available tools

ToolPurpose
set_product_steeringApp description, user journeys, MVP features, success metrics
set_tech_steeringFrontend/backend stack, auth, data, infrastructure, constraints
set_structure_steeringRepo layout, naming conventions, architecture patterns
add_open_questionLog decisions that need resolution
resolve_open_questionClose out questions with documented decisions
get_steering_summaryCheck what’s missing
checkpoint_steering_filesPersist to disk

Each tool description guides the AI toward producing content optimized for LLM‑friendly bullets, exact versions, anti‑patterns, and file purposes. The descriptions are the secret sauce:

const techDescription = `Write in terse bullet-point format. For each field include:
- Exact versions (e.g., "Next.js 14.2" not "Next.js")
- Key conventions to follow
- What NOT to do (anti-patterns)
- Relevant CLI commands where applicable`;

SteeringStore: In‑Memory State with Atomic Writes

The store maintains steering state in memory and writes atomically to disk. The merge mode (merge vs. replace) controls whether updates extend existing content or overwrite it. Session state persists to a JSON file for recovery:

{
  "version": 1,
  "updatedAt": "2025-01-26T18:30:00.000Z",
  "product": {
    "appOneLiner": "A task management app for remote teams",
    "targetUsers": "Distributed engineering teams"
  },
  "tech": {
    "frontend": "React with TypeScript",
    "backend": "Node.js with Express"
  }
}

Restart the server, and the conversation picks up where you left off.

Design Decisions

A few things I learned building this:

Conversational state is harder than it looks

The first major challenge was maintaining conversational state—tracking what topics have been covered, what remains outstanding, and storing open questions for later follow‑up. The solution was a state‑file management system combined with Zod‑validated tool calling. This lets the AI atomically update steering files mid‑conversation while persisting session context to a state.json file that enables recovery across interruptions. The schemas validate structure; the state file captures continuity.

Humans think (long) before they answer

Human decision‑making often involves pauses while thinking about architecture and tech‑stack choices. Those pauses can timeout a streaming session. We addressed this with a keep‑alive timer in the session manager that sends periodic signals to maintain the Bedrock connection during extended thinking pauses without prematurely terminating active conversations.

Tool descriptions matter more than schemas

Early versions had minimal tool descriptions with detailed JSON schemas. The AI called tools correctly but produced generic content. The fix was treating tool descriptions as prompts: provide specific format guidance, examples of good output, and explicit anti‑patterns. Schemas still validate structure, but descriptions shape quality. In short, good prompt engineering makes all the difference!

Tool execution must be fast

Because our tools execute synchronously and interleaved with speech, slow tools would create audible pauses. All seven steering tools are designed for sub‑millisecond execution: in‑memory state updates with no blocking I/O. File persistence happens via checkpoint_steering_files(), which the model calls at natural conversation breaks. If you add custom tools, keep them fast or make them async with immediate acknowledgment.

If you’re building something voice‑powered, I want to hear about it – leave a comment below!

Interested in giving Kiro Steering Studio a try? The code is available at our GitHub repo:

https://github.com/aws-samples/sample-kiro-steering-studio?sc_channel=sm&sc_publisher=YOUTUBE&sc_country=global&sc_geo=GLOBAL&sc_outcome=awareness&trkCampaign=78b97721-98e7-4499-a2db-d7f66c04e460&sc_content=2026_developer_campaigns_kiro_NAMER&sc_category=Amazon%20Nova&trk=78b97721-98e7-4499-a2db-d7f66c04e460&linkId=909040901

0 views
Back to Blog

Related posts

Read more »

DevOps and Vibe Coding: A Journey

Things to Do Map Your Application - Map your application on paper, in a spreadsheet, or using graphics/flowcharts. This is the first step. - Understanding the...

OpenAI just raised $110 billion. Wow

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as we...