[Paper] Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

Published: 1 month ago (January 2, 2026 at 01:17 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.00777v1

Overview

The paper investigates whether multimodal large language models (MLLMs)—which excel at image‑ and video‑deepfake detection—can be repurposed for audio deepfake detection. By feeding audio clips together with carefully crafted text prompts, the authors test if these models can learn robust cross‑modal representations that flag synthetic speech. Their findings suggest that, with minimal task‑specific tuning, MLLMs can achieve competitive performance on in‑domain audio deepfake data, opening a new avenue for security‑focused AI tools.

Key Contributions

First systematic study of Vision‑Language/Multimodal LLMs applied to audio deepfake detection.
Introduced a multi‑prompt strategy that combines audio inputs with text‑based queries (question‑answer style and binary decisions) to guide the model’s reasoning.
Evaluated two state‑of‑the‑art MLLMs—Qwen2‑Audio‑7B‑Instruct and SALMONN—in both zero‑shot and fine‑tuned settings.
Demonstrated that minimal supervision (few‑shot fine‑tuning) yields strong in‑domain detection while highlighting the models’ struggle with out‑of‑domain generalisation.
Provided an empirical baseline for future research on multimodal approaches to audio deepfake detection.

Methodology

Data Preparation
- Collected a benchmark of genuine and synthetic speech samples (e.g., from ASVspoof, WaveFake).
- Split the data into in‑domain (same distribution as training) and out‑of‑domain (different speakers, recording conditions) sets.
Prompt Design
- Crafted text prompts that act as queries to the model, e.g.,
  - “Is this audio clip real or generated?” (binary)
  - “Explain why this speech might be a deepfake.” (reasoning)
- Multiple prompts per audio sample were concatenated to provide richer context.
Model Configurations
- Zero‑shot: Feed audio + prompt directly to the pretrained MLLM without any weight updates.
- Fine‑tuned: Lightly fine‑tune the entire model (or just the projection heads) on a small labeled subset (few‑shot).
Evaluation Metrics
- Primary: Equal Error Rate (EER) and Area Under the ROC Curve (AUC) for binary detection.
- Secondary: Qualitative analysis of model explanations generated by the reasoning prompts.
Implementation Details
- Audio encoded using the model’s built‑in front‑end (e.g., wav2vec‑style encoder).
- Text prompts tokenized with the same tokenizer as the LLM, ensuring seamless multimodal fusion.

Results & Findings

Model	Setting	In‑Domain EER ↓	Out‑of‑Domain EER ↑
Qwen2‑Audio‑7B‑Instruct	Zero‑shot	~28%	>45%
Qwen2‑Audio‑7B‑Instruct	Fine‑tuned (few‑shot)	12%	~30%
SALMONN	Zero‑shot	~31%	>48%
SALMONN	Fine‑tuned (few‑shot)	14%	~33%

Fine‑tuning with a handful of labeled examples dramatically reduces EER on the same domain, confirming that the models can quickly adapt when given task‑specific signals.
Zero‑shot performance is weak, indicating that raw multimodal knowledge alone isn’t sufficient for audio deepfake detection.
Out‑of‑domain degradation remains significant, highlighting a need for better generalisation techniques (e.g., domain‑adaptive prompting or data augmentation).
The reasoning prompts produce interpretable explanations, though their accuracy correlates with detection performance.

Practical Implications

Rapid Prototyping: Developers can leverage existing MLLMs (e.g., Qwen2‑Audio) as a starting point for audio deepfake detectors, requiring only a small, curated fine‑tuning dataset.
Unified Security Stack: Organizations already using vision‑based deepfake detectors can extend the same multimodal infrastructure to audio, simplifying deployment pipelines.
Explainability: The question‑answer prompts yield human‑readable rationales, useful for compliance audits or user‑facing trust signals.
Edge‑Ready Variants: Since the models are 7‑B parameters, they can be distilled or quantised for on‑device inference in voice assistants, call‑center monitoring, or streaming platforms.
Prompt Engineering as a Feature: The multi‑prompt approach demonstrates that thoughtful prompt design can act as a lightweight “feature extractor,” reducing the need for heavy‑weight acoustic feature engineering.

Limitations & Future Work

Generalisation Gap: Models still falter on out‑of‑domain audio, suggesting that larger, more diverse training corpora or domain‑adaptive prompting are needed.
Data Efficiency: While few‑shot fine‑tuning helps, the exact amount of labeled data required for stable performance isn’t fully explored.
Model Size vs. Latency: 7‑B models may be too heavy for real‑time, high‑throughput services without further optimisation.
Prompt Sensitivity: Performance varies with prompt phrasing; systematic prompt‑search methods could be investigated.
Broader Modalities: Extending the approach to audio‑visual deepfakes (e.g., lip‑sync attacks) could unlock more comprehensive anti‑spoofing solutions.

Overall, the study shows that multimodal LLMs hold promise for audio deepfake detection, especially when paired with smart prompting and modest fine‑tuning, but further work is required to make them robust across the wild.

Authors

Akanksha Chuchra
Shukesh Reddy
Sudeepta Mishra
Abhijit Das
Abhinav Dhall

Paper Information

arXiv ID: 2601.00777v1
Categories: cs.SD, cs.CV
Published: January 2, 2026
PDF: Download PDF

[Paper] Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Fusion-SSAT: Unleashing the Potential of Self-supervised Auxiliary Task by Feature Fusion for Generalized Deepfake Detection

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing