ASR (Automatic Speech Recognition)

Published: 4 hours ago (December 18, 2025 at 05:30 PM EST)

1 min read

Source: Dev.to

Overview

Cover image for ASR (Automatic Speech Recognition)

Yesterday I shared the full Voice AI pipeline.
Today we’re diving deep into Stage 1: ASR (Automatic Speech Recognition) – turning spoken words into text.

ASR diagram

Feature Extraction

Raw audio → digital representation

MFCCs (Mel‑Frequency Cepstral Coefficients)
Spectrograms
Filter Banks

Acoustic Modeling

Maps audio features to phonemes

Traditional: HMM‑GMM, DNN‑HMM
Modern: Transformers, Conformers

Decoding & Language Modeling

Phonemes → words using probabilities

Beam Search
CTC (Connectionist Temporal Classification)
Attention mechanisms

Post‑Processing

Clean up the output

Spell checking
Punctuation
Capitalization

Evolution of ASR

Traditional (1980s‑2010s)

HMM + GMM
Required phonetic alignment
Separate components stitched together

State‑of‑the‑art (Now)

Whisper: 680 K hours of training, 50+ languages
Wav2Vec 2.0: Self‑supervised, works with limited data

Getting ASR wrong can cause the entire voice pipeline to fail; it’s the foundation of any Voice AI system.

What ASR model are you using? Any surprises with accuracy or latency?

ASR (Automatic Speech Recognition)

Overview

Feature Extraction

Acoustic Modeling

Decoding & Language Modeling

Post‑Processing

Evolution of ASR

Traditional (1980s‑2010s)

State‑of‑the‑art (Now)

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner