[Paper] Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

Published: 2 months ago (February 5, 2026 at 01:46 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.06000v1

Overview

This paper investigates whether Whisper — OpenAI’s open‑source speech‑to‑text model — can serve as a powerful feature extractor for Speech Emotion Recognition (SER). By pairing Whisper’s deep acoustic embeddings with two novel attention‑based pooling layers, the authors achieve state‑of‑the‑art performance on both English and Persian emotion datasets while keeping the model footprint small enough for real‑time applications.

Key Contributions

Repurposing Whisper for SER: Demonstrates that Whisper’s encoder outputs contain rich emotional cues, even though the model was trained solely for automatic speech recognition.
Two attention‑based pooling schemes:
1. Multi‑head Attentive Average Pooling (MH‑AAP) – aggregates frame‑level embeddings using multiple attention heads before averaging.
2. QKV Pooling – computes query, key, value projections on Whisper embeddings and performs a single‑step self‑attention to produce a compact utterance‑level vector.
Layer‑wise analysis: Shows that intermediate Whisper encoder layers (rather than the final layer) often yield the most discriminative emotion features, especially for Persian.
Lightweight SER pipeline: Achieves a 2.47 % absolute gain in unweighted accuracy on the Persian ShEMO benchmark using Whisper‑Tiny, beating much larger models such as HuBERT X‑Large.
Cross‑lingual validation: Experiments on both IEMOCAP (English) and ShEMO (Persian) confirm the approach’s generality across languages.

Methodology

Feature Extraction: Audio recordings are passed through Whisper (Tiny or Small). The model’s transformer encoder produces a sequence of 768‑dimensional (Tiny) or 1024‑dimensional (Small) frame‑level embeddings.
Attention‑Based Pooling:
- MH‑AAP splits the embedding space into several heads, computes a soft attention weight for each frame per head, averages the weighted frames, and finally concatenates the heads.
- QKV Pooling projects the sequence into query (Q), key (K), and value (V) matrices, computes a self‑attention score softmax(QKᵀ/√d), and multiplies it by V to obtain a single pooled vector.
  Both methods dramatically shrink the temporal dimension (hundreds of frames → 1‑vector) while preserving the most emotion‑relevant information.
Classification Head: The pooled vector feeds a simple feed‑forward network (two linear layers + ReLU + dropout) that outputs probabilities over emotion classes.
Training & Evaluation: Standard cross‑entropy loss, Adam optimizer, and early stopping on a validation split. Experiments compare: (a) different Whisper encoder layers, (b) Tiny vs. Small model size, and (c) the two pooling strategies.

Results & Findings

Dataset	Whisper Model	Pooling	Unweighted Accuracy (UWA)	Relative Gain vs. Baseline
IEMOCAP (English)	Small	QKV	71.3 %	+1.8 %
ShEMO (Persian)	Tiny	QKV (multi‑head)	78.9 %	+2.47 % (state‑of‑the‑art)
ShEMO (Persian)	Tiny	MH‑AAP	77.4 %	+1.9 %

Intermediate layers win: For Persian, layers 6‑8 of Whisper‑Tiny consistently outperformed the final layer, suggesting that early acoustic patterns (prosody, pitch) are more emotion‑rich than the later ASR‑optimized representations.
Pooling matters: QKV pooling gave the best trade‑off between dimensionality reduction and performance, outperforming simple mean‑pooling by ~1.5 % absolute.
Model size vs. performance: Whisper‑Tiny + QKV already surpasses HuBERT X‑Large (≈ 2 B parameters) on ShEMO, highlighting the efficiency of the proposed pipeline.

Practical Implications

Edge‑ready SER: Developers can embed Whisper‑Tiny (≈ 39 M parameters) plus a lightweight attention pooler on smartphones, wearables, or in‑car systems to detect user emotions in real time without needing massive GPU resources.
Cross‑language deployment: Since Whisper is trained on 99+ languages, the same feature extractor can be reused for new languages with only a small fine‑tuning of the pooling and classifier layers, accelerating multilingual SER product roll‑outs.
Modular architecture: The attention pooling modules are framework‑agnostic (PyTorch, TensorFlow, ONNX) and can be swapped into existing ASR pipelines, turning any Whisper‑based transcription service into an emotion‑aware interface.
Reduced data collection burden: By leveraging a pre‑trained ASR model, teams can achieve high SER accuracy with relatively modest emotion‑labeled datasets, cutting down on costly annotation efforts.
Potential use‑cases: Customer‑service bots that adapt tone based on caller mood, mental‑health monitoring apps, interactive gaming NPCs, and driver‑state monitoring for safety systems.

Limitations & Future Work

Dataset scope: Experiments are limited to IEMOCAP (English) and ShEMO (Persian). Broader validation on more diverse corpora (e.g., spontaneous speech, noisy environments) is needed.
Emotion granularity: The study focuses on categorical emotions (e.g., happy, sad). Extending to dimensional models (valence‑arousal) or mixed emotions could improve real‑world relevance.
Temporal dynamics: The pooling collapses the entire utterance into a single vector, potentially discarding fine‑grained temporal cues useful for detecting emotion shifts within a conversation. Future work could explore hierarchical or segment‑level attention.
Fine‑tuning Whisper: The authors kept Whisper frozen. Joint fine‑tuning of Whisper’s encoder with the SER objective might unlock further gains, albeit at higher computational cost.

Overall, the paper provides a compelling recipe for turning a state‑of‑the‑art ASR model into a lightweight, high‑performing emotion recognizer—an attractive proposition for any developer building next‑generation voice‑centric products.

Authors

Ali Shendabadi
Parnia Izadirad
Mostafa Salehi
Mahmoud Bijankhan

Paper Information

arXiv ID: 2602.06000v1
Categories: cs.AI, cs.CL
Published: February 5, 2026
PDF: Download PDF

[Paper] Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] The Representational Geometry of Number