Beyond Dictation: Building Software Just by Talking

Published: 2 months ago (February 23, 2026 at 08:40 PM EST)

9 min read

Source: Dev.to

Source: Dev.to

TL;DR: Kiro Steering Studio is a voice‑powered tool that generates structured Kiro steering files through natural conversation — not dictation. Built on Amazon Nova 2 Sonic’s bidirectional streaming, it routes what you say to the right files, tracks open questions, and produces AI‑optimized markdown context for your workspace. This post covers how it’s built, why it’s different from the voice‑AI tools you already use, and what I learned along the way.

Voice is becoming a first‑class interface in developer tooling

The voice‑to‑text space has exploded in 2026, with several products competing to make typing obsolete. Some treat it as a faster way to input text, while others use it to drive structured, agentic workflows.

The Voice AI Landscape for Devs in Q1 2026

OpenAI – Codex
Voice is a supported feature, but the scope is intentionally narrow.
The official Codex macOS app (released Feb 2026) includes voice commands that let developers speak prompts directly into the agent interface. The VS Code extension similarly supports voice‑to‑text dictation for entering instructions. In both cases, voice is a prompt‑delivery mechanism: you speak a task, Codex executes it in an isolated sandbox, and proposes a PR. The voice interface doesn’t change the interaction model; it merely removes your keyboard from the loop.
Anthropic – Claude
Anthropic has introduced an official Voice Mode for the general Claude app (mobile & web). Voice capabilities for Claude Code are largely based on community‑developed, third‑party integrations.
Cursor
Cursor 2.0 introduced official voice support. Voice Mode lets you control the editor and its AI features with spoken commands such as “open file app.ts”, “extract function”, or “refactor this to use async/await”. The AI drafts a patch in response. This is a meaningful step beyond pure dictation because spoken instructions can trigger multi‑step edits.
SuperWhisper & WisprFlow
These sit at the other end of the spectrum—general‑purpose dictation tools that developers adopt for everything from crafting prompts to drafting documentation. WisprFlow wins on seamless “flow” with auto‑edits that make dictation feel natural. Both integrate via keyboard shortcuts and excel at transcription.

All of these tools validate the same insight: voice is faster and more natural than typing. However, they all act as input mechanisms.

When you use any of these tools to build software, you’re still doing the cognitive work of:

Structuring information into the right format
Maintaining consistency in terminology and conventions
Organizing content into logical sections

You might speak faster than you type, but you’re still manually authoring markdown files.

What Kiro Steering Files Actually Do

Before explaining how Kiro Steering Studio works, it’s worth understanding what steering files are and why they matter. At its core, steering gives Kiro persistent knowledge about your workspace through markdown files. Instead of explaining your conventions in every chat, steering files ensure Kiro consistently follows your established patterns, libraries, and standards.

Kiro Steering Files

The three core files that capture project context

File	Purpose
`product.md`	Defines what you’re building: a one‑liner, target users, MVP journeys & features, non‑goals, success metrics, and a domain glossary.
`tech.md`	Defines how to build it: frontend stack, backend approach, authentication, data storage, IaC, observability, and styling guide.
`structure.md`	Defines project organization: repository layout, naming conventions, import patterns, architecture patterns, and testing approach.

Writing these by hand is tedious. Kiro offers an “easy‑button” to auto‑generate them if you already have a well‑established codebase, but that’s not the case when you’re building a new application from scratch.

How Steering Studio Is Different

Kiro Steering Studio treats voice as an interface to structured knowledge generation, not just simple transcription. Instead of manually writing steering files, you talk about your project. The AI asks clarifying questions, probes for details you might have overlooked, and generates properly structured steering files in real time. The conversation becomes the documentation.

Conversational Extraction

You have a natural conversation instead of dictating pre‑structured content. Example:

“I’m building a task‑management app for internal engineering teams using React with TypeScript and Node.js.”

The AI doesn’t just transcribe verbatim; it asks clarifying questions you might not have considered:

“What’s your state‑management approach — Redux, React Query, or Context API?”
“How do you handle authentication?”
“What’s your testing approach?”
“Should authentication use OAuth or magic links?”

Each answer updates the appropriate steering file as the conversation probes for completeness.

Intelligent Routing

The AI understands where information belongs:

Mentioning “React with TypeScript” automatically updates the frontend section of tech.md.
Describing user journeys populates product.md.
Explaining your directory structure updates structure.md.

Active Gap Detection

The AI tracks what’s missing. If you haven’t specified your frontend stack or naming conventions, it logs open questions and prompts you to resolve them. When you answ … (the original content ends here; the sentence is intentionally left incomplete to preserve the source material).

How It’s Built

The architecture splits into four concerns: streaming, session management, steering state, and tool handling.

NovaSonicClient: Bidirectional Audio Streaming

At the center of our app is real‑time, bidirectional streaming with Amazon Bedrock, specifically with Nova 2 Sonic – Amazon’s speech‑to‑speech foundation model.

Unlike a request‑response flow where you record speech, send it as one request, wait for a response, then execute tool calls in a batch, Nova 2 Sonic processes audio as you speak and interleaves tool execution with the conversation.

Traditional voice‑AI flow

Record all speech
Send complete audio
Wait for response
Execute tool calls
Send results
Wait for final response

Bidirectional streaming flow

Audio streams continuously – no waiting for speech to finish
Model responds while you’re still talking
Tool calls happen mid‑conversation, not after
Results flow back immediately; model continues speaking

Audio buffers queue up (max 220 chunks) and are processed in batches of five to prevent overwhelming the stream. When the queue fills under pressure, old chunks are shed to maintain real‑time responsiveness.

The client handles the session lifecycle—start, audio content, prompts, tool results, and graceful shutdown—through a state machine that tracks which events have been sent.

Tool System: Synchronous Execution, Interleaved with Speech

Tool calls don’t wait until you finish speaking. The model might be mid‑sentence describing your project, realize it should update the product steering, emit a toolUse event, get the result back, and continue talking. This happens through the toolResult event handler:

session.onEvent('toolEnd', async (d: unknown) => {
  const toolData = d as ToolEndData;
  const result = runTool(store, toolData.toolName, toolData.toolUseContent); // Synchronous
  await sonic.sendToolResult(socket.id, toolData.toolUseId, result);          // Send back to model
});

Available tools

Tool	Purpose
`set_product_steering`	App description, user journeys, MVP features, success metrics
`set_tech_steering`	Frontend/backend stack, auth, data, infrastructure, constraints
`set_structure_steering`	Repo layout, naming conventions, architecture patterns
`add_open_question`	Log decisions that need resolution
`resolve_open_question`	Close out questions with documented decisions
`get_steering_summary`	Check what’s missing
`checkpoint_steering_files`	Persist to disk

Each tool description guides the AI toward producing content optimized for LLM‑friendly bullets, exact versions, anti‑patterns, and file purposes. The descriptions are the secret sauce:

const techDescription = `Write in terse bullet-point format. For each field include:
- Exact versions (e.g., "Next.js 14.2" not "Next.js")
- Key conventions to follow
- What NOT to do (anti-patterns)
- Relevant CLI commands where applicable`;

SteeringStore: In‑Memory State with Atomic Writes

The store maintains steering state in memory and writes atomically to disk. The merge mode (merge vs. replace) controls whether updates extend existing content or overwrite it. Session state persists to a JSON file for recovery:

{
  "version": 1,
  "updatedAt": "2025-01-26T18:30:00.000Z",
  "product": {
    "appOneLiner": "A task management app for remote teams",
    "targetUsers": "Distributed engineering teams"
  },
  "tech": {
    "frontend": "React with TypeScript",
    "backend": "Node.js with Express"
  }
}

Restart the server, and the conversation picks up where you left off.

Design Decisions

A few things I learned building this:

Conversational state is harder than it looks

The first major challenge was maintaining conversational state—tracking what topics have been covered, what remains outstanding, and storing open questions for later follow‑up. The solution was a state‑file management system combined with Zod‑validated tool calling. This lets the AI atomically update steering files mid‑conversation while persisting session context to a state.json file that enables recovery across interruptions. The schemas validate structure; the state file captures continuity.

Humans think (long) before they answer

Human decision‑making often involves pauses while thinking about architecture and tech‑stack choices. Those pauses can timeout a streaming session. We addressed this with a keep‑alive timer in the session manager that sends periodic signals to maintain the Bedrock connection during extended thinking pauses without prematurely terminating active conversations.

Tool descriptions matter more than schemas

Early versions had minimal tool descriptions with detailed JSON schemas. The AI called tools correctly but produced generic content. The fix was treating tool descriptions as prompts: provide specific format guidance, examples of good output, and explicit anti‑patterns. Schemas still validate structure, but descriptions shape quality. In short, good prompt engineering makes all the difference!

Tool execution must be fast

Because our tools execute synchronously and interleaved with speech, slow tools would create audible pauses. All seven steering tools are designed for sub‑millisecond execution: in‑memory state updates with no blocking I/O. File persistence happens via checkpoint_steering_files(), which the model calls at natural conversation breaks. If you add custom tools, keep them fast or make them async with immediate acknowledgment.

If you’re building something voice‑powered, I want to hear about it – leave a comment below!

Interested in giving Kiro Steering Studio a try? The code is available at our GitHub repo:

https://github.com/aws-samples/sample-kiro-steering-studio?sc_channel=sm&sc_publisher=YOUTUBE&sc_country=global&sc_geo=GLOBAL&sc_outcome=awareness&trkCampaign=78b97721-98e7-4499-a2db-d7f66c04e460&sc_content=2026_developer_campaigns_kiro_NAMER&sc_category=Amazon%20Nova&trk=78b97721-98e7-4499-a2db-d7f66c04e460&linkId=909040901

Beyond Dictation: Building Software Just by Talking

Voice is becoming a first‑class interface in developer tooling

The Voice AI Landscape for Devs in Q1 2026

What Kiro Steering Files Actually Do

The three core files that capture project context

How Steering Studio Is Different

Conversational Extraction

Intelligent Routing

Active Gap Detection

How It’s Built

NovaSonicClient: Bidirectional Audio Streaming

Traditional voice‑AI flow

Bidirectional streaming flow

Tool System: Synchronous Execution, Interleaved with Speech

Available tools

SteeringStore: In‑Memory State with Atomic Writes

Design Decisions

Conversational state is harder than it looks

Humans think (long) before they answer

Tool descriptions matter more than schemas

Tool execution must be fast

Related posts

Stop Queuing Inference Requests

The 3-Layer Architecture That Keeps My AI Business Running

Self-Hosting Remote VSCode with Cloudflare Tunnel and Authentik SSO

The AI Infrastructure Decision Matrix: Build vs. Buy in 2026

Voice is becoming a first‑class interface in developer tooling

The Voice AI Landscape for Devs in Q1 2026

What Kiro Steering Files Actually Do

The three core files that capture project context

How Steering Studio Is Different

Conversational Extraction

Intelligent Routing

Active Gap Detection

How It’s Built

NovaSonicClient: Bidirectional Audio Streaming

Traditional voice‑AI flow

Bidirectional streaming flow

Tool System: Synchronous Execution, Interleaved with Speech

Available tools

SteeringStore: In‑Memory State with Atomic Writes

Design Decisions

Conversational state is harder than it looks

Humans think (long) before they answer

Tool descriptions matter more than schemas

Tool execution must be fast

Related posts

Stop Queuing Inference Requests

The 3-Layer Architecture That Keeps My AI Business Running

Self-Hosting Remote VSCode with Cloudflare Tunnel and Authentik SSO

The AI Infrastructure Decision Matrix: Build vs. Buy in 2026

The Voice AI Landscape for Devs in Q1 2026