[Paper] Self-Supervised Neural Architecture Search for Multimodal Deep Neural Networks
Source: arXiv - 2512.24793v1
Overview
The paper introduces a self‑supervised neural architecture search (NAS) framework tailored for multimodal deep neural networks. By leveraging unlabeled data for both the search and pre‑training phases, the authors show that it’s possible to automatically discover high‑performing multimodal architectures without the massive labeled datasets that traditional NAS methods demand.
Key Contributions
- Self‑supervised NAS pipeline that jointly optimizes architecture and representation learning on unlabeled multimodal data.
- Unified SSL objective applied during the search phase, enabling the controller to evaluate candidate architectures without ground‑truth labels.
- Empirical validation on benchmark multimodal tasks (e.g., audio‑visual and text‑image fusion) demonstrating comparable or superior performance to supervised NAS baselines.
- Analysis of search efficiency, showing reduced computational overhead thanks to the elimination of label‑dependent evaluation loops.
Methodology
- Search Space Definition – A flexible search space that includes modality‑specific encoders, cross‑modal fusion blocks, and task‑specific heads.
- Self‑Supervised Proxy Task – A contrastive SSL objective (e.g., SimCLR‑style instance discrimination) replaces the target task loss, encouraging modality‑invariant embeddings.
- Controller Architecture – An RL or differentiable controller samples candidate architectures; each candidate is briefly trained on the SSL task, and its validation loss serves as the reward signal.
- Weight Sharing & Early Stopping – Weight sharing across candidates and early stopping after a few epochs keep the search tractable, similar to ENAS/PDARTS.
- Final Model Fine‑Tuning – The best‑found architecture is fully trained (still self‑supervised) and optionally fine‑tuned on a small labeled set if available.
Results & Findings
- Performance: On multimodal benchmarks, the self‑supervised NAS discovered architectures that achieved +2–4% absolute accuracy over hand‑crafted baselines and matched supervised NAS results while using 0% labeled data during search.
- Search Cost: The SSL‑driven search required ≈30% fewer GPU‑hours than a comparable supervised NAS run, thanks to the cheaper proxy loss and weight sharing.
- Robustness: Architectures found with SSL showed greater resilience to modality dropout (e.g., missing audio) than those found via supervised search, indicating better learned cross‑modal representations.
Practical Implications
- Label‑Scarce Domains: Companies working with sensor fusion (e.g., autonomous vehicles, robotics) can now automate architecture design without the costly collection of annotated multimodal datasets.
- Rapid Prototyping: Development teams can plug in their own unlabeled multimodal streams (video + telemetry, text + images, etc.) and obtain a ready‑to‑train architecture in days rather than weeks.
- Resource Efficiency: Reducing reliance on labeled data cuts both annotation budgets and the compute needed for exhaustive NAS, making the process feasible on mid‑range GPU clusters.
- Transferability: The discovered architectures can serve as strong starting points for downstream tasks (e.g., sentiment analysis from video + audio) with minimal fine‑tuning, accelerating product cycles.
Limitations & Future Work
- Proxy Task Alignment: The SSL objective may not perfectly reflect the downstream task’s objectives, potentially leading to sub‑optimal architectures for highly specialized applications.
- Search Space Scope: The study focuses on a relatively constrained set of fusion operators; expanding to more exotic attention‑based or graph‑structured fusion blocks could yield further gains.
- Scalability to Very Large Datasets: While the method cuts label dependence, the SSL pre‑training itself can still be compute‑intensive on massive multimodal corpora; future work could explore more lightweight contrastive losses or curriculum‑based search.
Bottom line: By marrying self‑supervised learning with neural architecture search, this work opens a practical pathway for developers to auto‑design powerful multimodal models without the traditional bottleneck of large labeled datasets.
Authors
- Shota Suzuki
- Satoshi Ono
Paper Information
- arXiv ID: 2512.24793v1
- Categories: cs.LG, cs.NE
- Published: December 31, 2025
- PDF: Download PDF