[Paper] Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution

Published: 3 days ago (February 27, 2026 at 01:13 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.24240v1

Overview

The paper introduces GTASR, a new training framework that enables one‑step, real‑world image super‑resolution (Real‑ISR) with the speed of consistency models while preserving the structural fidelity usually lost in such fast approaches. By aligning the diffusion trajectory geometrically and enforcing dual‑reference structural constraints, GTASR bridges the gap between high‑quality diffusion‑based SR and the low‑latency demands of production systems.

Key Contributions

Trajectory Alignment (TA): A full‑path projection technique that corrects the tangent‑vector field of the diffusion trajectory, eliminating the “consistency drift” that accumulates during transitive training.
Dual‑Reference Structural Rectification (DRSR): A lightweight module that simultaneously leverages the low‑resolution input and a learned high‑frequency reference to enforce strict geometric consistency, solving the “Geometric Decoupling” problem.
One‑step inference: GTASR produces high‑quality super‑resolved images in a single forward pass, cutting inference latency by ≈10‑15× compared with conventional diffusion samplers.
Parameter‑efficient design: The model stays under 30 M parameters, far smaller than T2I‑distilled teachers that often exceed 300 M, making it suitable for edge devices.
Comprehensive evaluation: State‑of‑the‑art results on multiple Real‑ISR benchmarks (e.g., RealSR, DRealSR) with both perceptual (LPIPS, NIQE) and fidelity (PSNR, SSIM) metrics, plus user studies confirming visual superiority.

Methodology

Base Consistency Model – The authors start from a standard consistency model that learns a mapping from noisy low‑resolution (LR) images to clean high‑resolution (HR) outputs via a single denoising step.
Trajectory Alignment – During training, each intermediate diffusion state is projected onto the true diffusion manifold using a closed‑form full‑path projection. This corrects the direction of the learned tangent vectors, preventing the drift that normally compounds when the model is applied repeatedly.
Dual‑Reference Structural Rectification – Two references guide the generation:
- LR structural cue: the original low‑resolution image is upsampled (e.g., bicubic) and injected as a spatial prior.
- High‑frequency guide: a shallow network extracts edge/texture maps from the LR input, which are then fused with the denoised output through a structural loss (edge‑aware L1 + perceptual similarity).
  The combined loss forces the generated HR image to stay pixel‑aligned and structurally coherent with the source scene.
Training Pipeline – The model is trained end‑to‑end on a large Real‑ISR dataset using a mixture of diffusion‑style noise schedules and the TA/DRSR regularizers. No teacher model is required, keeping the training budget modest.

Results & Findings

Dataset	PSNR ↑	SSIM ↑	LPIPS ↓	Inference Time (ms)
RealSR (×4)	28.7	0.842	0.112	18
DRealSR (×4)	27.9	0.831	0.119	19
Baseline Consistency (no TA/DRSR)	27.3	0.818	0.138	18
T2I‑Distilled Diffusion (8‑step)	28.5	0.839	0.115	120

GTASR matches or exceeds the perceptual quality of multi‑step diffusion baselines while being ~6‑7× faster.
Ablation studies show that removing TA increases LPIPS by +0.025, and dropping DRSR degrades SSIM by ‑0.015, confirming each component’s impact.
User studies (100 participants) rated GTASR outputs as “most natural” in 68 % of cases, surpassing the next best method (55 %).

Practical Implications

Real‑time Upscaling in Apps – Mobile photo editors, video streaming platforms, and AR/VR pipelines can integrate GTASR to deliver high‑quality upscaling without GPU‑heavy diffusion loops.
Edge Deployment – With a sub‑30 M parameter footprint and single‑step latency under 20 ms on a mid‑range GPU (RTX 3060), GTASR fits on‑device inference on modern smartphones equipped with NPUs or Tensor Cores.
Cost‑Effective Cloud Services – Cloud‑based image enhancement APIs can serve many more requests per GPU hour, reducing operational expenses while maintaining premium visual results.
Foundation for Other Tasks – The TA and DRSR concepts are generic; they can be transplanted to other one‑step generative problems such as denoising, deblurring, or even video frame interpolation.

Limitations & Future Work

Training Data Bias – GTASR is trained on publicly available Real‑ISR datasets; performance may degrade on highly specialized domains (e.g., medical imaging) without fine‑tuning.
Extreme Upscaling – The current work focuses on 4× upscaling; scaling factors beyond 8× may still require multi‑step refinement or larger models.
Structural Reference Quality – DRSR relies on edge extraction from the LR image; heavily compressed or noisy inputs can produce weak structural cues, limiting rectification effectiveness.
Future Directions – The authors plan to explore adaptive noise schedules that further reduce drift, incorporate self‑supervised domain adaptation for niche datasets, and extend the framework to video super‑resolution with temporal consistency guarantees.

Authors

Chengyan Deng
Zhangquan Chen
Li Yu
Kai Zhang
Xue Zhou
Wang Zhang

Paper Information

arXiv ID: 2602.24240v1
Categories: cs.CV
Published: February 27, 2026
PDF: Download PDF

[Paper] Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

[Paper] Hierarchical Action Learning for Weakly-Supervised Action Segmentation

[Paper] Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models