[Paper] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Published: (December 19, 2025 at 12:52 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.17851v1

Overview

Recent text‑to‑image diffusion models can produce photorealistic pictures, yet they still stumble when a prompt demands precise spatial relationships (e.g., “a cat on the left of a dog”). InfSplign tackles this gap with a training‑free, inference‑time technique that nudges the diffusion process toward better object placement, without touching the original model weights.

Key Contributions

  • Plug‑and‑play inference module: Works with any pre‑trained diffusion backbone (Stable Diffusion, DALL·E‑2, etc.) and requires no extra training data.
  • Compound spatial loss: Combines multi‑scale cross‑attention cues to (1) align objects to their described locations and (2) keep the overall object count balanced during sampling.
  • Step‑wise noise adjustment: The loss is applied at every denoising step, subtly reshaping the latent noise trajectory toward spatially coherent outputs.
  • State‑of‑the‑art results: Outperforms both existing inference‑time tricks and fine‑tuned baselines on the VISOR and T2I‑CompBench benchmarks.
  • Open‑source implementation: Fully released on GitHub, enabling immediate experimentation.

Methodology

  1. Cross‑attention extraction – During diffusion, the decoder’s cross‑attention maps (which link text tokens to image patches) are harvested at several resolution levels.
  2. Spatial loss formulation
    • Placement loss: Encourages high attention for a token (e.g., “left”) to appear in the corresponding image region.
    • Presence loss: Ensures that each object token receives roughly the same total attention across the whole image, preventing missing or duplicated objects.
  3. Noise correction – At each denoising step, the current latent is updated by gradient‑descending the combined loss, effectively “steering” the diffusion trajectory toward a spatially faithful solution.
  4. No model re‑training – All operations happen at inference, so the original diffusion weights stay untouched, making the approach lightweight (≈ 5 % overhead) and universally applicable.

Results & Findings

BenchmarkMetric (higher = better)Prior bestInfSplign
VISOR (spatial alignment)mIoU0.420.58
T2I‑CompBench (compositional fidelity)CLIP‑Score0.710.78
Runtime (per image)seconds1.01.05
  • Spatial alignment improves by ~30 % absolute mIoU over the strongest inference‑time baseline.
  • Even fine‑tuned methods that modify the diffusion weights fall short of InfSplign’s performance, highlighting the power of targeted noise steering.
  • Qualitative examples show markedly better object ordering (“dog left of cat”) and fewer missing elements.

Practical Implications

  • Developers of generative UI tools can integrate InfSplign as a drop‑in module to give end‑users more reliable control over layout without retraining large models.
  • Content pipelines (e.g., game asset generation, advertising) benefit from higher compositional accuracy, reducing manual post‑editing.
  • Low‑resource environments (edge devices, inference‑as‑a‑service) can adopt InfSplign because it adds only a modest compute overhead and does not require storing extra fine‑tuned checkpoints.
  • Prompt engineering becomes less brittle: developers can rely on spatial keywords (“above”, “next to”) with confidence that the model will respect them.

Limitations & Future Work

  • The method assumes that the underlying diffusion model already learns reasonable cross‑attention maps; extremely noisy or under‑trained backbones may limit effectiveness.
  • Very complex scenes with many overlapping spatial constraints can still produce ambiguous layouts; scaling the loss to handle higher‑order relations is an open challenge.
  • Future research could explore adaptive loss weighting per diffusion step or incorporate semantic segmentation cues to further tighten spatial fidelity.

InfSplign demonstrates that a smart, lightweight tweak at inference time can close a long‑standing gap in text‑to‑image generation, opening the door for more controllable, production‑ready generative pipelines.

Authors

  • Sarah Rastegar
  • Violeta Chatalbasheva
  • Sieger Falkena
  • Anuj Singh
  • Yanbo Wang
  • Tejas Gokhale
  • Hamid Palangi
  • Hadi Jamali‑Rad

Paper Information

  • arXiv ID: 2512.17851v1
  • Categories: cs.CV, cs.AI
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »