[Paper] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
Source: arXiv - 2512.17851v1
Overview
Recent text‑to‑image diffusion models can produce photorealistic pictures, yet they still stumble when a prompt demands precise spatial relationships (e.g., “a cat on the left of a dog”). InfSplign tackles this gap with a training‑free, inference‑time technique that nudges the diffusion process toward better object placement, without touching the original model weights.
Key Contributions
- Plug‑and‑play inference module: Works with any pre‑trained diffusion backbone (Stable Diffusion, DALL·E‑2, etc.) and requires no extra training data.
- Compound spatial loss: Combines multi‑scale cross‑attention cues to (1) align objects to their described locations and (2) keep the overall object count balanced during sampling.
- Step‑wise noise adjustment: The loss is applied at every denoising step, subtly reshaping the latent noise trajectory toward spatially coherent outputs.
- State‑of‑the‑art results: Outperforms both existing inference‑time tricks and fine‑tuned baselines on the VISOR and T2I‑CompBench benchmarks.
- Open‑source implementation: Fully released on GitHub, enabling immediate experimentation.
Methodology
- Cross‑attention extraction – During diffusion, the decoder’s cross‑attention maps (which link text tokens to image patches) are harvested at several resolution levels.
- Spatial loss formulation –
- Placement loss: Encourages high attention for a token (e.g., “left”) to appear in the corresponding image region.
- Presence loss: Ensures that each object token receives roughly the same total attention across the whole image, preventing missing or duplicated objects.
- Noise correction – At each denoising step, the current latent is updated by gradient‑descending the combined loss, effectively “steering” the diffusion trajectory toward a spatially faithful solution.
- No model re‑training – All operations happen at inference, so the original diffusion weights stay untouched, making the approach lightweight (≈ 5 % overhead) and universally applicable.
Results & Findings
| Benchmark | Metric (higher = better) | Prior best | InfSplign |
|---|---|---|---|
| VISOR (spatial alignment) | mIoU | 0.42 | 0.58 |
| T2I‑CompBench (compositional fidelity) | CLIP‑Score | 0.71 | 0.78 |
| Runtime (per image) | seconds | 1.0 | 1.05 |
- Spatial alignment improves by ~30 % absolute mIoU over the strongest inference‑time baseline.
- Even fine‑tuned methods that modify the diffusion weights fall short of InfSplign’s performance, highlighting the power of targeted noise steering.
- Qualitative examples show markedly better object ordering (“dog left of cat”) and fewer missing elements.
Practical Implications
- Developers of generative UI tools can integrate InfSplign as a drop‑in module to give end‑users more reliable control over layout without retraining large models.
- Content pipelines (e.g., game asset generation, advertising) benefit from higher compositional accuracy, reducing manual post‑editing.
- Low‑resource environments (edge devices, inference‑as‑a‑service) can adopt InfSplign because it adds only a modest compute overhead and does not require storing extra fine‑tuned checkpoints.
- Prompt engineering becomes less brittle: developers can rely on spatial keywords (“above”, “next to”) with confidence that the model will respect them.
Limitations & Future Work
- The method assumes that the underlying diffusion model already learns reasonable cross‑attention maps; extremely noisy or under‑trained backbones may limit effectiveness.
- Very complex scenes with many overlapping spatial constraints can still produce ambiguous layouts; scaling the loss to handle higher‑order relations is an open challenge.
- Future research could explore adaptive loss weighting per diffusion step or incorporate semantic segmentation cues to further tighten spatial fidelity.
InfSplign demonstrates that a smart, lightweight tweak at inference time can close a long‑standing gap in text‑to‑image generation, opening the door for more controllable, production‑ready generative pipelines.
Authors
- Sarah Rastegar
- Violeta Chatalbasheva
- Sieger Falkena
- Anuj Singh
- Yanbo Wang
- Tejas Gokhale
- Hamid Palangi
- Hadi Jamali‑Rad
Paper Information
- arXiv ID: 2512.17851v1
- Categories: cs.CV, cs.AI
- Published: December 19, 2025
- PDF: Download PDF