ASR (Automatic Speech Recognition)
Source: Dev.to
Overview

Yesterday I shared the full Voice AI pipeline.
Today we’re diving deep into Stage 1: ASR (Automatic Speech Recognition) – turning spoken words into text.

Feature Extraction
Raw audio → digital representation
- MFCCs (Mel‑Frequency Cepstral Coefficients)
- Spectrograms
- Filter Banks
Acoustic Modeling
Maps audio features to phonemes
- Traditional: HMM‑GMM, DNN‑HMM
- Modern: Transformers, Conformers
Decoding & Language Modeling
Phonemes → words using probabilities
- Beam Search
- CTC (Connectionist Temporal Classification)
- Attention mechanisms
Post‑Processing
Clean up the output
- Spell checking
- Punctuation
- Capitalization
Evolution of ASR
Traditional (1980s‑2010s)
- HMM + GMM
- Required phonetic alignment
- Separate components stitched together
State‑of‑the‑art (Now)
- Whisper: 680 K hours of training, 50+ languages
- Wav2Vec 2.0: Self‑supervised, works with limited data
Getting ASR wrong can cause the entire voice pipeline to fail; it’s the foundation of any Voice AI system.
What ASR model are you using? Any surprises with accuracy or latency?