[Paper] Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion
Source: arXiv - 2602.03817v1
Overview
The paper presents FINCH (Fusion under INdependent Conditional Hypotheses), a lightweight framework that adaptively blends predictions from an audio‑only classifier with a separate spatiotemporal model (e.g., location, season) for bioacoustic species identification. By learning a per‑sample gating function that gauges the trustworthiness of contextual cues, FINCH can automatically fall back to the audio‑only model when context is noisy, delivering more reliable and interpretable predictions.
Key Contributions
- Adaptive log‑linear fusion: Introduces a mathematically grounded way to combine discriminative audio and contextual predictors without needing calibrated generative models.
- Per‑sample reliability estimator: A gating network that uses uncertainty (e.g., entropy) and informativeness scores to decide how much weight to give the spatiotemporal evidence for each input.
- Risk‑contained hypothesis class: Guarantees that the fused model never performs worse than the audio‑only baseline because the audio predictor is always a fallback option.
- State‑of‑the‑art results: Sets new performance records on the CBI benchmark and matches or exceeds prior work on multiple BirdSet subsets, all with a modest computational footprint.
- Open‑source implementation: Provides a reproducible codebase that can be plugged into existing audio classification pipelines.
Methodology
Base models – Train (or reuse) two independent classifiers:
- Audio model (f_a(x)) that maps raw spectrograms to species probabilities.
- Context model (f_c(s)) that maps spatiotemporal metadata (latitude, month, etc.) to the same probability space.
Evidence representation – Convert each model’s output into log‑probabilities (logits) to enable additive combination, mirroring Bayesian multiplication of independent evidence.
Gating function – A small neural network (g(x, s)) takes the audio input and its metadata, computes:
- Uncertainty (e.g., predictive entropy of (f_a)).
- Informativeness (e.g., mutual information between context and class).
The gate outputs a scalar (\alpha \in [0,1]) that scales the contribution of the context logits.
Fusion rule – The final logit vector is:
[ \log p_{\text{FINCH}} = \log f_a(x) + \alpha \cdot \log f_c(s) ]
When (\alpha = 0) the model reduces to the audio‑only predictor; when (\alpha = 1) it fully trusts the context.Training – The audio and context models are frozen (or fine‑tuned) while the gating network is trained end‑to‑end on the classification loss, encouraging it to learn when context helps and when it should be ignored.
Results & Findings
| Dataset | Baseline (Audio‑only) | Fixed‑weight Fusion | FINCH (adaptive) |
|---|---|---|---|
| CBI (large‑scale) | 78.4 % accuracy | 80.1 % | 84.7 % (new SOTA) |
| BirdSet – Forest subset | 71.2 % | 73.0 % | 75.5 % |
| BirdSet – Urban subset | 68.9 % | 69.4 % | 70.2 % |
- Robustness: When contextual features are deliberately corrupted (e.g., random GPS), FINCH’s performance drops only to the audio‑only level, whereas fixed‑weight fusion degrades significantly.
- Interpretability: The learned (\alpha) values correlate strongly with intuitive measures (high (\alpha) for recordings during species‑specific migration windows, low (\alpha) for ambiguous audio).
- Efficiency: The gating network adds < 0.5 M parameters and incurs < 2 ms latency per sample on a CPU, making it suitable for edge deployment.
Practical Implications
- Deployable wildlife monitoring: Conservation teams can run a single model on low‑power devices (e.g., Raspberry Pi) that automatically leverages location/season data when reliable, but safely defaults to audio‑only predictions otherwise.
- Generalizable to other domains: Any classification task with heterogeneous evidence (e.g., medical imaging + patient history, video + sensor metadata) can adopt FINCH’s gating‑based fusion without redesigning the underlying classifiers.
- Risk‑aware AI services: SaaS platforms that expose model predictions to end‑users can use FINCH to guarantee a “safe fallback” performance level, reducing liability when auxiliary data are noisy or missing.
- Rapid prototyping: Because FINCH works with pre‑trained black‑box models, developers can experiment with new contextual signals (e.g., weather, time‑of‑day) without retraining the heavy audio backbone.
Limitations & Future Work
- Assumption of conditional independence: FINCH treats audio and context as independent evidence sources; strong correlations (e.g., location influencing acoustic properties) could violate this and limit optimality.
- Reliance on good uncertainty estimates: The gating function’s decisions hinge on calibrated uncertainty from the audio model; poorly calibrated networks may mis‑weight context.
- Scalability to many modalities: Extending the log‑linear fusion to more than two evidence streams (e.g., multi‑sensor IoT) may require more sophisticated gating architectures.
- Future directions: The authors suggest exploring Bayesian calibration techniques for the audio predictor, hierarchical gating for multi‑modal fusion, and applying FINCH to real‑time streaming scenarios where context arrives asynchronously.
Authors
- Oscar Ovanger
- Levi Harris
- Timothy H. Keitt
Paper Information
- arXiv ID: 2602.03817v1
- Categories: cs.SD, cs.AI
- Published: February 3, 2026
- PDF: Download PDF