[Paper] MaCo-GAN: Manifold-Contrastive Adversarial Learning for Single Image Super-Resolution
Source: arXiv - 2606.05068v1
Overview
The paper introduces MaCo‑GAN, a new way to train single‑image super‑resolution (SISR) models that reduces the “hallucinated” artifacts often seen in GAN‑based upscalers. By swapping the classic adversarial loss for a supervised contrastive objective, the authors create a more disciplined game between generator and discriminator, leading to sharper, more realistic high‑resolution outputs without sacrificing fidelity to the low‑resolution input.
Key Contributions
- Manifold‑Contrastive GAN (MaCo‑GAN): Replaces the conventional adversarial loss with a contrastive loss that explicitly separates “on‑manifold” (plausible) and “off‑manifold” (implausible) fake samples.
- Dynamic Fake Sample Synthesizer: Generates a continuum of fake HR images from the ground‑truth by applying controlled degradations, guaranteeing that every fake still corresponds to the same LR input.
- Contrastive Minimax Game: Formulates generator and discriminator objectives as a push‑pull contrastive problem—generators are encouraged to move toward on‑manifold fakes and away from off‑manifold ones, while discriminators do the opposite.
- Drop‑in Replacement: The new loss can be swapped into existing SR pipelines (e.g., ESRGAN, RCAN) without architectural changes, delivering consistent perception‑distortion improvements across multiple datasets.
- Extensive Ablations & Analysis: Provides thorough experiments that dissect the impact of each component (fake synthesis strength, contrastive temperature, batch composition) and visualizes the evolving feature space during training.
Methodology
-
Fake Sample Synthesis
- Starting from a high‑resolution ground‑truth (GT) image, the authors apply a set of stochastic degradations (blur, noise, compression) to create a spectrum of fake HR images.
- All fakes share the same low‑resolution (LR) counterpart, ensuring the conditional relationship (LR → HR) stays intact.
-
Contrastive Objective
- For each training step, a batch contains: the real GT, several on‑manifold fakes (low distortion), and several off‑manifold fakes (high distortion).
- The discriminator is trained with a supervised contrastive loss that pulls together embeddings of on‑manifold samples and pushes apart embeddings of off‑manifold samples.
- The generator receives the opposite signal: it tries to make its output’s embedding close to the on‑manifold cluster and far from the off‑manifold cluster.
-
Training Loop
- The generator produces an SR image from the LR input.
- The discriminator processes the SR image together with the synthesized fakes and the GT, computing the contrastive loss.
- Gradients are back‑propagated to both networks, forming a contrastive minimax game that replaces the usual GAN binary cross‑entropy loss.
-
Integration
- The authors plug the MaCo‑GAN loss into several state‑of‑the‑art SR backbones, keeping all other hyper‑parameters unchanged, demonstrating the method’s plug‑and‑play nature.
Results & Findings
| Model (baseline) | PSNR ↑ / SSIM ↑ | LPIPS ↓ (perceptual) | MOS (Mean Opinion Score) |
|---|---|---|---|
| ESRGAN | 27.8 / 0.81 | 0.12 | 3.4 |
| ESRGAN + MaCo‑GAN | 27.5 / 0.80 | 0.09 | 3.9 |
| RCAN | 28.3 / 0.84 | 0.11 | 3.2 |
| RCAN + MaCo‑GAN | 28.0 / 0.83 | 0.08 | 3.8 |
- Perception‑Distortion Trade‑off: MaCo‑GAN consistently pushes the LPIPS curve down (better perceptual quality) while only modestly affecting PSNR/SSIM, indicating that realism improves without a large fidelity penalty.
- Ablation Insights:
- Removing the dynamic fake synthesizer (using only a single fake) degrades performance, confirming the need for a diverse fake manifold.
- Varying the contrastive temperature shows a sweet spot (τ≈0.07) where the discriminator’s feature space is neither too tight nor too loose.
- Feature Space Evolution: t‑SNE visualizations reveal that on‑manifold samples form a tight cluster that the generator gradually learns to inhabit, while off‑manifold samples remain well separated throughout training.
Practical Implications
- Cleaner Upscaling in Production: Developers integrating SR into video streaming, gaming, or medical imaging pipelines can adopt MaCo‑GAN to reduce the “ghosting” and texture artifacts that often plague GAN‑based upscalers.
- Plug‑and‑Play Upgrade: Since the method only swaps the loss function, existing SR models can be upgraded without retraining the entire architecture, saving engineering effort.
- Better User Experience: For UI/UX teams, higher perceptual quality translates to sharper thumbnails, more realistic texture rendering in AR/VR, and improved visual fidelity in low‑bandwidth scenarios.
- Potential for Edge Deployment: The contrastive loss is computationally comparable to a standard GAN loss, meaning the training overhead is modest and inference remains unchanged—critical for on‑device SR on smartphones or embedded systems.
Limitations & Future Work
- Synthetic Fake Diversity: The quality of the contrastive game hinges on the fake synthesizer’s ability to span the true distribution of plausible HR images. Hand‑crafted degradations may miss more complex real‑world artifacts (e.g., sensor noise patterns).
- Scalability to Ultra‑High Res: Experiments stop at 4× upscaling (e.g., 720p → 1080p). It remains unclear how the method scales to 8× or 16× super‑resolution where hallucination risk is higher.
- Training Stability: While contrastive losses are less prone to mode collapse than binary GAN losses, the authors note occasional oscillations when the fake‑to‑real ratio is extreme, suggesting a need for adaptive sampling strategies.
- Future Directions: The authors propose extending the fake synthesizer with learned degradation models, integrating perceptual metrics directly into the contrastive objective, and exploring multi‑modal contrastive setups for video SR where temporal consistency matters.
Authors
- Daeyoung Han
- Seongmin Hwang
- Moongu Jeon
Paper Information
- arXiv ID: 2606.05068v1
- Categories: cs.CV
- Published: June 3, 2026
- PDF: Download PDF