[Paper] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

Published: 13 hours ago (March 10, 2026 at 01:51 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.09955v1

Overview

The paper introduces C2FMAE (Coarse‑to‑Fine Masked AutoEncoder), a self‑supervised vision model that learns visual representations at three levels of granularity—scene‑level semantics, object‑level masks, and raw pixels. By marrying the strengths of contrastive learning (global semantics) and masked image modeling (local detail), the authors achieve a more balanced and transferable feature extractor for downstream tasks such as classification, detection, and segmentation.

Key Contributions

Hierarchical pre‑training objective that simultaneously predicts semantic masks, instance masks, and pixel values, enforcing a top‑down learning flow.
Cascaded decoder architecture where each decoder stage refines the output of the previous, creating explicit cross‑granularity dependencies.
Progressive masking curriculum that starts with semantically guided masks, moves to instance‑guided masks, and finally to random masks, steering the model from coarse context to fine detail.
Large‑scale multi‑granular dataset: 1.28 M ImageNet‑1K images enriched with high‑quality pseudo‑labels for scene and object masks.
State‑of‑the‑art results on multiple benchmarks (ImageNet‑1K classification, COCO detection, ADE20K segmentation) without any supervised fine‑tuning.

Methodology

Data preparation – The authors generate pseudo‑semantic masks (e.g., foreground/background) and pseudo‑instance masks (object proposals) for every ImageNet‑1K image using off‑the‑shelf segmentation models. This yields three parallel “views” of each image: a coarse mask, a finer object mask, and the original RGB pixels.
Encoder – A standard Vision Transformer (ViT) processes the partially masked RGB input (random masking) and produces a latent representation.
Cascaded decoder –
- Stage 1 reconstructs the semantic mask from the encoder output.
- Stage 2 takes the latent plus the reconstructed semantic mask to predict the instance mask.
- Stage 3 finally uses both masks to reconstruct the full‑resolution RGB image.
  This cascade forces the later stages to rely on the earlier, coarser predictions, ensuring a hierarchical flow of information.
Progressive masking curriculum – Training proceeds in three phases:
- Phase 1: mask tokens are placed according to the semantic mask (large contiguous regions).
- Phase 2: masking follows the instance mask (object‑level occlusions).
- Phase 3: conventional random masking is applied.
  The curriculum gradually shifts the model’s attention from global layout to object boundaries and finally to fine‑grained texture.
Losses – Each decoder head has its own reconstruction loss (e.g., binary cross‑entropy for masks, L2 for pixels). The total loss is a weighted sum, encouraging all three granularities to be learned jointly.

Results & Findings

Benchmark	Baseline (MAE)	C2FMAE (ours)	Δ (↑)
ImageNet‑1K top‑1 accuracy (linear probe)	68.5 %	71.2 %	+2.7 %
COCO object detection (AP)	48.3	51.0	+2.7
ADE20K semantic segmentation (mIoU)	44.1	47.8	+3.7

Robustness to downstream tasks: The hierarchical pre‑training consistently outperforms plain masked autoencoders across classification, detection, and segmentation, confirming that the learned features retain both global context and fine detail.
Ablation studies: Removing the cascaded decoder or the progressive masking curriculum each drops performance by ~1.5 %–2 %, highlighting the synergy of the two design choices.
Efficiency: Despite the extra decoder stages, training time increases by only ~15 % because the stages share most of the transformer backbone.

Practical Implications

Better foundation models for vision APIs – Companies building image‑search, auto‑tagging, or visual QA services can adopt C2FMAE to obtain a single checkpoint that works well for both high‑level classification and low‑level segmentation without task‑specific pre‑training.
Reduced annotation cost – Since the model learns from pseudo‑labels generated automatically, organizations can bootstrap hierarchical representations on proprietary image collections without expensive manual labeling.
Improved transfer to edge devices – The encoder remains a vanilla ViT, so the same lightweight backbone can be deployed on mobile or embedded hardware while still benefiting from the richer pre‑training.
Facilitates multi‑task fine‑tuning – Developers can fine‑tune a single C2FMAE checkpoint for a suite of downstream tasks (e.g., detection + segmentation) with less risk of catastrophic forgetting, thanks to the already aligned hierarchical features.

Limitations & Future Work

Reliance on pseudo‑mask quality – The hierarchical signals come from automatically generated masks; errors in those masks could propagate through training.
Scalability to larger backbones – Experiments were limited to ViT‑Base; it remains to be seen how the approach scales to larger transformers or hybrid CNN‑ViT architectures.
Curriculum design heuristics – The three‑phase masking schedule is hand‑crafted; learning an optimal curriculum automatically could further boost performance.
Domain shift – While ImageNet‑1K coverage is broad, the method’s effectiveness on highly specialized domains (medical imaging, satellite data) needs validation.

Bottom line: C2FMAE offers a pragmatic recipe for building self‑supervised vision models that understand both the “big picture” and the “tiny details,” making it a compelling option for developers who need versatile visual backbones without the overhead of massive labeled datasets.*

Authors

Wenzhao Xiang
Yue Wu
Hongyang Yu
Feng Gao
Fan Yang
Xilin Chen

Paper Information

arXiv ID: 2603.09955v1
Categories: cs.CV, cs.LG
Published: March 10, 2026
PDF: Download PDF

[Paper] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Data Statistics to Feature Geometry: How Correlations Shape Superposition

[Paper] BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion

[Paper] No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space

[Paper] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation