Building a Browser-Based Voice-to-Text App with the Web Speech API

Published: (December 12, 2025 at 12:49 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Cover image for Building a Browser-Based Voice-to-Text App with the Web Speech API

Why Browser-Based?

Privacy is the main sell. Audio never leaves the user’s device. No uploads, no storage, no GDPR headaches. For a simple transcription tool, this is a huge advantage.

The Web Speech API Basics

The API is surprisingly simple:

const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();

recognition.continuous = true;
recognition.interimResults = true;
recognition.lang = 'en-US';

recognition.onresult = (event) => {
  const transcript = Array.from(event.results)
    .map(result => result[0].transcript)
    .join('');
  console.log(transcript);
};

recognition.start();

That’s it. You now have live speech‑to‑text.

The Gotchas Nobody Warns You About

1. Browser support is inconsistent

Chrome uses Google’s servers (ironically, not fully local). Safari uses on‑device processing. Firefox support is limited. Always check:

if (!('SpeechRecognition' in window || 'webkitSpeechRecognition' in window)) {
  // Show fallback UI
}

2. It stops listening randomly

The API has a habit of stopping after silence. You need to restart it:

recognition.onend = () => {
  if (shouldKeepListening) {
    recognition.start();
  }
};

3. Punctuation doesn’t exist

The API returns raw words with no periods, commas, or capitalization. You’ll need to handle this yourself:

function addAutoPunctuation(text) {
  // Add period after pause patterns
  // Capitalize after periods
  // Handle common patterns like "question mark" → "?"
}

4. Language switching is manual

You need to build your own language selector and set recognition.lang accordingly. The API supports 100+ languages but won’t auto‑detect.

When to NOT Use Web Speech API

For anything beyond basic dictation, you’ll hit walls:

  • Audio file transcription — API only does live mic input
  • Speaker identification — Not supported
  • Timestamps — Not provided
  • Accuracy requirements — Enterprise use cases need Whisper, AssemblyAI, or Deepgram

I ended up building a hybrid: free tier uses Web Speech API for live dictation, Pro tier uses Whisper for file uploads and higher accuracy.

Native Language SEO Bonus

One unexpected win: I built language‑specific pages with native script UI. The Hindi page is actually in Hindi (हिंदी में वॉइस टू टेक्स्ट), not just “Hindi Voice to Text” in English.

Result: Started ranking for native‑language searches with far less competition than English keywords.

Try It

I built this into voicetotextonline.com — free to use, no signup for basic transcription.

If you’re building something similar, happy to answer questions in the comments.

Back to Blog

Related posts

Read more »

Introduction to React

What is React? React is a JavaScript library for building user interfaces. It was developed by Facebook Meta and is now open‑source, widely used in web develop...