[Paper] Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion

Published: 3 months ago (February 3, 2026 at 01:21 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.03817v1

Overview

The paper presents FINCH (Fusion under INdependent Conditional Hypotheses), a lightweight framework that adaptively blends predictions from an audio‑only classifier with a separate spatiotemporal model (e.g., location, season) for bioacoustic species identification. By learning a per‑sample gating function that gauges the trustworthiness of contextual cues, FINCH can automatically fall back to the audio‑only model when context is noisy, delivering more reliable and interpretable predictions.

Key Contributions

Adaptive log‑linear fusion: Introduces a mathematically grounded way to combine discriminative audio and contextual predictors without needing calibrated generative models.
Per‑sample reliability estimator: A gating network that uses uncertainty (e.g., entropy) and informativeness scores to decide how much weight to give the spatiotemporal evidence for each input.
Risk‑contained hypothesis class: Guarantees that the fused model never performs worse than the audio‑only baseline because the audio predictor is always a fallback option.
State‑of‑the‑art results: Sets new performance records on the CBI benchmark and matches or exceeds prior work on multiple BirdSet subsets, all with a modest computational footprint.
Open‑source implementation: Provides a reproducible codebase that can be plugged into existing audio classification pipelines.

Methodology

Base models – Train (or reuse) two independent classifiers:
- Audio model (f_a(x)) that maps raw spectrograms to species probabilities.
- Context model (f_c(s)) that maps spatiotemporal metadata (latitude, month, etc.) to the same probability space.
Evidence representation – Convert each model’s output into log‑probabilities (logits) to enable additive combination, mirroring Bayesian multiplication of independent evidence.
Gating function – A small neural network (g(x, s)) takes the audio input and its metadata, computes:
- Uncertainty (e.g., predictive entropy of (f_a)).
- Informativeness (e.g., mutual information between context and class).
  The gate outputs a scalar (\alpha \in [0,1]) that scales the contribution of the context logits.
Fusion rule – The final logit vector is:
[ \log p_{\text{FINCH}} = \log f_a(x) + \alpha \cdot \log f_c(s) ]
When (\alpha = 0) the model reduces to the audio‑only predictor; when (\alpha = 1) it fully trusts the context.
Training – The audio and context models are frozen (or fine‑tuned) while the gating network is trained end‑to‑end on the classification loss, encouraging it to learn when context helps and when it should be ignored.

Results & Findings

Dataset	Baseline (Audio‑only)	Fixed‑weight Fusion	FINCH (adaptive)
CBI (large‑scale)	78.4 % accuracy	80.1 %	84.7 % (new SOTA)
BirdSet – Forest subset	71.2 %	73.0 %	75.5 %
BirdSet – Urban subset	68.9 %	69.4 %	70.2 %

Robustness: When contextual features are deliberately corrupted (e.g., random GPS), FINCH’s performance drops only to the audio‑only level, whereas fixed‑weight fusion degrades significantly.
Interpretability: The learned (\alpha) values correlate strongly with intuitive measures (high (\alpha) for recordings during species‑specific migration windows, low (\alpha) for ambiguous audio).
Efficiency: The gating network adds < 0.5 M parameters and incurs < 2 ms latency per sample on a CPU, making it suitable for edge deployment.

Practical Implications

Deployable wildlife monitoring: Conservation teams can run a single model on low‑power devices (e.g., Raspberry Pi) that automatically leverages location/season data when reliable, but safely defaults to audio‑only predictions otherwise.
Generalizable to other domains: Any classification task with heterogeneous evidence (e.g., medical imaging + patient history, video + sensor metadata) can adopt FINCH’s gating‑based fusion without redesigning the underlying classifiers.
Risk‑aware AI services: SaaS platforms that expose model predictions to end‑users can use FINCH to guarantee a “safe fallback” performance level, reducing liability when auxiliary data are noisy or missing.
Rapid prototyping: Because FINCH works with pre‑trained black‑box models, developers can experiment with new contextual signals (e.g., weather, time‑of‑day) without retraining the heavy audio backbone.

Limitations & Future Work

Assumption of conditional independence: FINCH treats audio and context as independent evidence sources; strong correlations (e.g., location influencing acoustic properties) could violate this and limit optimality.
Reliance on good uncertainty estimates: The gating function’s decisions hinge on calibrated uncertainty from the audio model; poorly calibrated networks may mis‑weight context.
Scalability to many modalities: Extending the log‑linear fusion to more than two evidence streams (e.g., multi‑sensor IoT) may require more sophisticated gating architectures.
Future directions: The authors suggest exploring Bayesian calibration techniques for the audio predictor, hierarchical gating for multi‑modal fusion, and applying FINCH to real‑time streaming scenarios where context arrives asynchronously.

Authors

Oscar Ovanger
Levi Harris
Timothy H. Keitt

Paper Information

arXiv ID: 2602.03817v1
Categories: cs.SD, cs.AI
Published: February 3, 2026
PDF: Download PDF

[Paper] Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

AI News Roundup: ChatGPT Ads Testing, the AI Super Bowl, and India’s Sovereign Models

OpenAI's new Codex app hits 1M+ downloads in first week — but limits may be coming to free and Go users

Imagen 4 vs Ideogram vs SD3.5: Which Image Model Fits Your Product Roadmap?

AI News Roundup: Ads in ChatGPT, Discord age checks, and GitHub agentic workflows