[Paper] Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta
Source: arXiv - 2603.02181v1
Overview
A new study tackles the notoriously hard problem of classifying Intangible Cultural Heritage (ICH) photographs from Vietnam’s Mekong Delta. By marrying a hybrid CoAtNet vision model with a lightweight ensembling trick called model soups, the authors achieve a solid boost in accuracy—without any extra inference cost—on a tiny, highly imbalanced dataset.
Key Contributions
- Hybrid CoAtNet backbone that fuses convolutional and self‑attention layers to capture both local texture and global context in heritage images.
- Model soups (greedy and uniform checkpoint averaging) applied to a single training run, delivering variance reduction comparable to full ensembles but with zero runtime overhead.
- Bias‑variance analysis that quantifies how soups stabilize predictions while keeping bias low, offering a theoretical lens for practitioners.
- Geometric diversity diagnostics using cross‑entropy distance and Multidimensional Scaling (MDS) to show soups pick truly diverse checkpoints, unlike naïve soft‑voting ensembles.
- State‑of‑the‑art results on the ICH‑17 dataset (7,406 images, 17 classes): 72.36 % top‑1 accuracy and 69.28 % macro F1, beating ResNet‑50, DenseNet‑121, and Vision Transformers.
Methodology
- Data & Challenge – The ICH‑17 collection is small and visually homogeneous (many classes share similar colors, patterns, and backgrounds). Traditional deep nets tend to overfit or latch onto spurious cues.
- CoAtNet Backbone – The network is built in stages: early layers use depth‑wise convolutions for fine‑grained texture, later stages switch to multi‑head self‑attention for global scene understanding. This hybrid design is more data‑efficient than pure CNNs or pure Transformers.
- Training Trajectory & Checkpoints – During a single training run, the model is saved at several epochs after the learning‑rate schedule plateaus (e.g., epochs 30, 35, 40, 45). Each checkpoint represents a slightly different local optimum.
- Model Soups
- Uniform Soup: simple arithmetic mean of all selected checkpoints’ weights.
- Greedy Soup: iteratively adds the checkpoint that most improves validation loss when averaged with the current soup, stopping when no further gain is observed.
The resulting “soup” is a single set of weights that can be loaded once for inference.
- Evaluation – Standard top‑1 accuracy and macro‑averaged F1 are reported, alongside a bias‑variance decomposition (using the classic decomposition of expected error into bias² + variance + irreducible noise).
Results & Findings
| Model | Top‑1 Acc. | Macro F1 |
|---|---|---|
| ResNet‑50 | 61.2 % | 58.1 % |
| DenseNet‑121 | 63.5 % | 60.4 % |
| ViT‑Base/16 | 66.8 % | 63.9 % |
| CoAtNet (single checkpoint) | 68.9 % | 65.7 % |
| CoAtNet + Uniform Soup | 71.4 % | 68.1 % |
| CoAtNet + Greedy Soup | 72.36 % | 69.28 % |
- Variance reduction: The soup models show a ~30 % drop in the variance component of the error decomposition, confirming that averaging diverse snapshots stabilizes predictions.
- Bias impact: Added bias is negligible (<1 % of total error), meaning the ensemble does not “wash out” the learned features.
- Diversity matters: MDS plots of checkpoint embeddings reveal that greedy soup selects checkpoints spread across the output space, whereas soft‑voting ensembles cluster tightly, explaining the superior performance of soups.
Practical Implications
- Zero‑cost ensembles: Developers can get ensemble‑level gains without the memory or latency penalties of running multiple models—perfect for edge devices or mobile apps that need to classify cultural‑heritage photos on‑device.
- Low‑resource domains: The approach shines where labeled data are scarce (e.g., heritage preservation, medical imaging, niche industrial inspection). By simply saving a few extra checkpoints, teams can squeeze extra accuracy out of existing training pipelines.
- Model‑agnostic recipe: While the paper uses CoAtNet, the soup technique works with any architecture (CNN, Transformer, hybrid). Teams can plug it into their current CI/CD training workflow with minimal code changes.
- Interpretability boost: The bias‑variance analysis and checkpoint‑space visualizations give engineers a diagnostic tool to understand why a model is over‑fitting, guiding data‑augmentation or regularization decisions.
Limitations & Future Work
- Dataset size & diversity: Results are validated on a single 7k‑image dataset; broader generalization to other cultural‑heritage collections remains to be proven.
- Checkpoint selection heuristics: The greedy algorithm is simple but may miss globally optimal combinations; more sophisticated search (e.g., Bayesian optimization) could yield further gains.
- Real‑time constraints: Although inference cost is unchanged, the training phase requires storing multiple checkpoints, which could be memory‑intensive for very large models.
- Future directions suggested by the authors include: extending soups to multi‑task settings (e.g., simultaneous classification and segmentation), exploring adaptive weighting of checkpoints rather than uniform averaging, and testing the pipeline on other low‑resource vision problems such as rare species identification.
Authors
- Quoc‑Khang Tran
- Minh‑Thien Nguyen
- Nguyen‑Khang Pham
Paper Information
- arXiv ID: 2603.02181v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: March 2, 2026
- PDF: Download PDF