[Paper] Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution

Published: (February 27, 2026 at 01:13 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.24240v1

Overview

The paper introduces GTASR, a new training framework that enables one‑step, real‑world image super‑resolution (Real‑ISR) with the speed of consistency models while preserving the structural fidelity usually lost in such fast approaches. By aligning the diffusion trajectory geometrically and enforcing dual‑reference structural constraints, GTASR bridges the gap between high‑quality diffusion‑based SR and the low‑latency demands of production systems.

Key Contributions

  • Trajectory Alignment (TA): A full‑path projection technique that corrects the tangent‑vector field of the diffusion trajectory, eliminating the “consistency drift” that accumulates during transitive training.
  • Dual‑Reference Structural Rectification (DRSR): A lightweight module that simultaneously leverages the low‑resolution input and a learned high‑frequency reference to enforce strict geometric consistency, solving the “Geometric Decoupling” problem.
  • One‑step inference: GTASR produces high‑quality super‑resolved images in a single forward pass, cutting inference latency by ≈10‑15× compared with conventional diffusion samplers.
  • Parameter‑efficient design: The model stays under 30 M parameters, far smaller than T2I‑distilled teachers that often exceed 300 M, making it suitable for edge devices.
  • Comprehensive evaluation: State‑of‑the‑art results on multiple Real‑ISR benchmarks (e.g., RealSR, DRealSR) with both perceptual (LPIPS, NIQE) and fidelity (PSNR, SSIM) metrics, plus user studies confirming visual superiority.

Methodology

  1. Base Consistency Model – The authors start from a standard consistency model that learns a mapping from noisy low‑resolution (LR) images to clean high‑resolution (HR) outputs via a single denoising step.
  2. Trajectory Alignment – During training, each intermediate diffusion state is projected onto the true diffusion manifold using a closed‑form full‑path projection. This corrects the direction of the learned tangent vectors, preventing the drift that normally compounds when the model is applied repeatedly.
  3. Dual‑Reference Structural Rectification – Two references guide the generation:
    • LR structural cue: the original low‑resolution image is upsampled (e.g., bicubic) and injected as a spatial prior.
    • High‑frequency guide: a shallow network extracts edge/texture maps from the LR input, which are then fused with the denoised output through a structural loss (edge‑aware L1 + perceptual similarity).
      The combined loss forces the generated HR image to stay pixel‑aligned and structurally coherent with the source scene.
  4. Training Pipeline – The model is trained end‑to‑end on a large Real‑ISR dataset using a mixture of diffusion‑style noise schedules and the TA/DRSR regularizers. No teacher model is required, keeping the training budget modest.

Results & Findings

DatasetPSNR ↑SSIM ↑LPIPS ↓Inference Time (ms)
RealSR (×4)28.70.8420.11218
DRealSR (×4)27.90.8310.11919
Baseline Consistency (no TA/DRSR)27.30.8180.13818
T2I‑Distilled Diffusion (8‑step)28.50.8390.115120
  • GTASR matches or exceeds the perceptual quality of multi‑step diffusion baselines while being ~6‑7× faster.
  • Ablation studies show that removing TA increases LPIPS by +0.025, and dropping DRSR degrades SSIM by ‑0.015, confirming each component’s impact.
  • User studies (100 participants) rated GTASR outputs as “most natural” in 68 % of cases, surpassing the next best method (55 %).

Practical Implications

  • Real‑time Upscaling in Apps – Mobile photo editors, video streaming platforms, and AR/VR pipelines can integrate GTASR to deliver high‑quality upscaling without GPU‑heavy diffusion loops.
  • Edge Deployment – With a sub‑30 M parameter footprint and single‑step latency under 20 ms on a mid‑range GPU (RTX 3060), GTASR fits on‑device inference on modern smartphones equipped with NPUs or Tensor Cores.
  • Cost‑Effective Cloud Services – Cloud‑based image enhancement APIs can serve many more requests per GPU hour, reducing operational expenses while maintaining premium visual results.
  • Foundation for Other Tasks – The TA and DRSR concepts are generic; they can be transplanted to other one‑step generative problems such as denoising, deblurring, or even video frame interpolation.

Limitations & Future Work

  • Training Data Bias – GTASR is trained on publicly available Real‑ISR datasets; performance may degrade on highly specialized domains (e.g., medical imaging) without fine‑tuning.
  • Extreme Upscaling – The current work focuses on 4× upscaling; scaling factors beyond 8× may still require multi‑step refinement or larger models.
  • Structural Reference Quality – DRSR relies on edge extraction from the LR image; heavily compressed or noisy inputs can produce weak structural cues, limiting rectification effectiveness.
  • Future Directions – The authors plan to explore adaptive noise schedules that further reduce drift, incorporate self‑supervised domain adaptation for niche datasets, and extend the framework to video super‑resolution with temporal consistency guarantees.

Authors

  • Chengyan Deng
  • Zhangquan Chen
  • Li Yu
  • Kai Zhang
  • Xue Zhou
  • Wang Zhang

Paper Information

  • arXiv ID: 2602.24240v1
  • Categories: cs.CV
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »