[Paper] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Published: 1 month ago (December 19, 2025 at 12:52 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.17851v1

Overview

Recent text‑to‑image diffusion models can produce photorealistic pictures, yet they still stumble when a prompt demands precise spatial relationships (e.g., “a cat on the left of a dog”). InfSplign tackles this gap with a training‑free, inference‑time technique that nudges the diffusion process toward better object placement, without touching the original model weights.

Key Contributions

Plug‑and‑play inference module: Works with any pre‑trained diffusion backbone (Stable Diffusion, DALL·E‑2, etc.) and requires no extra training data.
Compound spatial loss: Combines multi‑scale cross‑attention cues to (1) align objects to their described locations and (2) keep the overall object count balanced during sampling.
Step‑wise noise adjustment: The loss is applied at every denoising step, subtly reshaping the latent noise trajectory toward spatially coherent outputs.
State‑of‑the‑art results: Outperforms both existing inference‑time tricks and fine‑tuned baselines on the VISOR and T2I‑CompBench benchmarks.
Open‑source implementation: Fully released on GitHub, enabling immediate experimentation.

Methodology

Cross‑attention extraction – During diffusion, the decoder’s cross‑attention maps (which link text tokens to image patches) are harvested at several resolution levels.
Spatial loss formulation –
- Placement loss: Encourages high attention for a token (e.g., “left”) to appear in the corresponding image region.
- Presence loss: Ensures that each object token receives roughly the same total attention across the whole image, preventing missing or duplicated objects.
Noise correction – At each denoising step, the current latent is updated by gradient‑descending the combined loss, effectively “steering” the diffusion trajectory toward a spatially faithful solution.
No model re‑training – All operations happen at inference, so the original diffusion weights stay untouched, making the approach lightweight (≈ 5 % overhead) and universally applicable.

Results & Findings

Benchmark	Metric (higher = better)	Prior best	InfSplign
VISOR (spatial alignment)	mIoU	0.42	0.58
T2I‑CompBench (compositional fidelity)	CLIP‑Score	0.71	0.78
Runtime (per image)	seconds	1.0	1.05

Spatial alignment improves by ~30 % absolute mIoU over the strongest inference‑time baseline.
Even fine‑tuned methods that modify the diffusion weights fall short of InfSplign’s performance, highlighting the power of targeted noise steering.
Qualitative examples show markedly better object ordering (“dog left of cat”) and fewer missing elements.

Practical Implications

Developers of generative UI tools can integrate InfSplign as a drop‑in module to give end‑users more reliable control over layout without retraining large models.
Content pipelines (e.g., game asset generation, advertising) benefit from higher compositional accuracy, reducing manual post‑editing.
Low‑resource environments (edge devices, inference‑as‑a‑service) can adopt InfSplign because it adds only a modest compute overhead and does not require storing extra fine‑tuned checkpoints.
Prompt engineering becomes less brittle: developers can rely on spatial keywords (“above”, “next to”) with confidence that the model will respect them.

Limitations & Future Work

The method assumes that the underlying diffusion model already learns reasonable cross‑attention maps; extremely noisy or under‑trained backbones may limit effectiveness.
Very complex scenes with many overlapping spatial constraints can still produce ambiguous layouts; scaling the loss to handle higher‑order relations is an open challenge.
Future research could explore adaptive loss weighting per diffusion step or incorporate semantic segmentation cues to further tighten spatial fidelity.

InfSplign demonstrates that a smart, lightweight tweak at inference time can close a long‑standing gap in text‑to‑image generation, opening the door for more controllable, production‑ready generative pipelines.

Authors

Sarah Rastegar
Violeta Chatalbasheva
Sieger Falkena
Anuj Singh
Yanbo Wang
Tejas Gokhale
Hamid Palangi
Hadi Jamali‑Rad

Paper Information

arXiv ID: 2512.17851v1
Categories: cs.CV, cs.AI
Published: December 19, 2025
PDF: Download PDF

[Paper] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] RadarGen: Automotive Radar Point Cloud Generation from Cameras

[Paper] Visually Prompted Benchmarks Are Surprisingly Fragile