[Paper] Self-Supervised Neural Architecture Search for Multimodal Deep Neural Networks

Published: 1 month ago (December 31, 2025 at 06:30 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.24793v1

Overview

The paper introduces a self‑supervised neural architecture search (NAS) framework tailored for multimodal deep neural networks. By leveraging unlabeled data for both the search and pre‑training phases, the authors show that it’s possible to automatically discover high‑performing multimodal architectures without the massive labeled datasets that traditional NAS methods demand.

Key Contributions

Self‑supervised NAS pipeline that jointly optimizes architecture and representation learning on unlabeled multimodal data.
Unified SSL objective applied during the search phase, enabling the controller to evaluate candidate architectures without ground‑truth labels.
Empirical validation on benchmark multimodal tasks (e.g., audio‑visual and text‑image fusion) demonstrating comparable or superior performance to supervised NAS baselines.
Analysis of search efficiency, showing reduced computational overhead thanks to the elimination of label‑dependent evaluation loops.

Methodology

Search Space Definition – A flexible search space that includes modality‑specific encoders, cross‑modal fusion blocks, and task‑specific heads.
Self‑Supervised Proxy Task – A contrastive SSL objective (e.g., SimCLR‑style instance discrimination) replaces the target task loss, encouraging modality‑invariant embeddings.
Controller Architecture – An RL or differentiable controller samples candidate architectures; each candidate is briefly trained on the SSL task, and its validation loss serves as the reward signal.
Weight Sharing & Early Stopping – Weight sharing across candidates and early stopping after a few epochs keep the search tractable, similar to ENAS/PDARTS.
Final Model Fine‑Tuning – The best‑found architecture is fully trained (still self‑supervised) and optionally fine‑tuned on a small labeled set if available.

Results & Findings

Performance: On multimodal benchmarks, the self‑supervised NAS discovered architectures that achieved +2–4% absolute accuracy over hand‑crafted baselines and matched supervised NAS results while using 0% labeled data during search.
Search Cost: The SSL‑driven search required ≈30% fewer GPU‑hours than a comparable supervised NAS run, thanks to the cheaper proxy loss and weight sharing.
Robustness: Architectures found with SSL showed greater resilience to modality dropout (e.g., missing audio) than those found via supervised search, indicating better learned cross‑modal representations.

Practical Implications

Label‑Scarce Domains: Companies working with sensor fusion (e.g., autonomous vehicles, robotics) can now automate architecture design without the costly collection of annotated multimodal datasets.
Rapid Prototyping: Development teams can plug in their own unlabeled multimodal streams (video + telemetry, text + images, etc.) and obtain a ready‑to‑train architecture in days rather than weeks.
Resource Efficiency: Reducing reliance on labeled data cuts both annotation budgets and the compute needed for exhaustive NAS, making the process feasible on mid‑range GPU clusters.
Transferability: The discovered architectures can serve as strong starting points for downstream tasks (e.g., sentiment analysis from video + audio) with minimal fine‑tuning, accelerating product cycles.

Limitations & Future Work

Proxy Task Alignment: The SSL objective may not perfectly reflect the downstream task’s objectives, potentially leading to sub‑optimal architectures for highly specialized applications.
Search Space Scope: The study focuses on a relatively constrained set of fusion operators; expanding to more exotic attention‑based or graph‑structured fusion blocks could yield further gains.
Scalability to Very Large Datasets: While the method cuts label dependence, the SSL pre‑training itself can still be compute‑intensive on massive multimodal corpora; future work could explore more lightweight contrastive losses or curriculum‑based search.

Bottom line: By marrying self‑supervised learning with neural architecture search, this work opens a practical pathway for developers to auto‑design powerful multimodal models without the traditional bottleneck of large labeled datasets.

Authors

Shota Suzuki
Satoshi Ono

Paper Information

arXiv ID: 2512.24793v1
Categories: cs.LG, cs.NE
Published: December 31, 2025
PDF: Download PDF

[Paper] Self-Supervised Neural Architecture Search for Multimodal Deep Neural Networks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Categorical Reparameterization with Denoising Diffusion models