ASR (Automatic Speech Recognition)

Published: (December 18, 2025 at 05:30 PM EST)
1 min read
Source: Dev.to

Source: Dev.to

Overview

Cover image for ASR (Automatic Speech Recognition)

Yesterday I shared the full Voice AI pipeline.
Today we’re diving deep into Stage 1: ASR (Automatic Speech Recognition) – turning spoken words into text.

ASR diagram

Feature Extraction

Raw audio → digital representation

  • MFCCs (Mel‑Frequency Cepstral Coefficients)
  • Spectrograms
  • Filter Banks

Acoustic Modeling

Maps audio features to phonemes

  • Traditional: HMM‑GMM, DNN‑HMM
  • Modern: Transformers, Conformers

Decoding & Language Modeling

Phonemes → words using probabilities

  • Beam Search
  • CTC (Connectionist Temporal Classification)
  • Attention mechanisms

Post‑Processing

Clean up the output

  • Spell checking
  • Punctuation
  • Capitalization

Evolution of ASR

Traditional (1980s‑2010s)

  • HMM + GMM
  • Required phonetic alignment
  • Separate components stitched together

State‑of‑the‑art (Now)

  • Whisper: 680 K hours of training, 50+ languages
  • Wav2Vec 2.0: Self‑supervised, works with limited data

Getting ASR wrong can cause the entire voice pipeline to fail; it’s the foundation of any Voice AI system.

What ASR model are you using? Any surprises with accuracy or latency?

Back to Blog

Related posts

Read more »