[Paper] Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Published: 15 hours ago (March 5, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.05503v1

Overview

A new paper tackles one of the biggest pain points in modern text‑to‑video generation: the sluggish inference speed of diffusion‑based models. By observing that many attention connections are essentially “dead weight,” the authors devise a training‑free technique called CalibAtt that prunes these connections on the fly, delivering up to 1.58× faster video synthesis without sacrificing visual quality or text‑video alignment.

Key Contributions

Empirical discovery of stable sparsity: Across multiple diffusion models and prompts, a large fraction of token‑to‑token attention scores are consistently near‑zero and follow repeatable block‑level patterns.
Calibrated Sparse Attention (CalibAtt): A two‑stage pipeline—offline calibration to learn reusable sparsity masks, followed by an optimized runtime that only computes the “important” attention links.
Hardware‑aware implementation: The method compiles layer‑, head‑, and timestep‑specific sparse kernels that fit modern GPU/CPU memory hierarchies, avoiding costly dynamic masking.
Broad evaluation: Tested on Wan 2.1 14B, Mochi 1, and several distilled few‑step models at 256‑720p resolutions, showing up to 1.58× end‑to‑end speedup while preserving FVD, CLIP‑Score, and human preference metrics.
Training‑free advantage: No extra fine‑tuning or data‑dependent retraining is required, making the approach instantly applicable to existing pipelines.

Methodology

Sparsity Analysis – The authors run a diagnostic pass on a set of representative video prompts, recording attention matrices for every transformer layer, head, and diffusion timestep. They identify block‑level groups of tokens (e.g., 8×8 spatial patches across time) that repeatedly receive negligible attention weights.
Calibration Pass – Using the diagnostic data, a static mask is generated for each layer/head/timestep that marks which token pairs can be safely ignored. The masks are calibrated to guarantee that the cumulative attention mass loss stays below a tiny threshold (e.g., 0.1 %).
Compiled Sparse Kernels – The masks are baked into custom CUDA kernels (or equivalent CPU kernels) that skip the computation of the masked connections while still performing dense attention for the selected ones. Because the pattern is fixed per timestep, the kernels can be heavily optimized (e.g., using shared memory tiling).
Inference – At generation time, the model runs exactly as before, except the attention operation now consults the pre‑compiled sparse kernels. No extra forward‑pass overhead is introduced, and the diffusion process proceeds unchanged.

The whole pipeline requires no retraining, no extra data, and only a one‑time offline calibration step that takes a few minutes on a single GPU.

Results & Findings

Model / Resolution	Baseline (FPS)	CalibAtt (FPS)	Speed‑up	FVD ↓ / ↑	CLIP‑Score Δ
Wan 2.1 14B @ 256p	1.2	1.9	1.58×	–0.3%	+0.01
Mochi 1 @ 480p	0.8	1.3	1.62×	–0.4%	+0.02
Distilled 8‑step @ 720p	0.5	0.78	1.56×	–0.2%	+0.00

Quality preservation: Across all benchmarks, the visual fidelity (measured by Fréchet Video Distance) and text‑video alignment (CLIP‑Score) remain statistically indistinguishable from the dense baseline.
Robustness to prompts: The sparsity patterns hold for diverse textual inputs (e.g., “a cat dancing in rain” vs. “a futuristic cityscape”), confirming that the calibration is not over‑fitted to a narrow prompt set.
Comparison to other training‑free tricks: Methods like static token pruning or low‑rank approximations achieve only ~1.2× speedup and often degrade quality; CalibAtt consistently outperforms them.

Practical Implications

Faster prototyping: Developers can now iterate on text‑to‑video ideas in near‑real time, dramatically reducing the feedback loop for creative tools, ad‑generation platforms, or game asset pipelines.
Cost savings in production: Cloud inference charges are directly proportional to GPU time; a 1.5× speedup translates to ~30 % lower compute bills for large‑scale video generation services.
Edge‑friendly deployments: Because the sparsity masks are static, they can be baked into lightweight inference runtimes, opening the door to on‑device video synthesis on high‑end mobiles or workstations without sacrificing latency.
Compatibility: Since CalibAtt works as a drop‑in replacement for the attention operator, existing diffusion pipelines (e.g., Diffusers, OpenAI’s video models) can adopt it with minimal code changes.

Limitations & Future Work

Calibration dependency: The offline sparsity masks are derived from a representative prompt set; extreme out‑of‑distribution prompts could expose hidden dense connections, potentially harming quality.
Memory layout constraints: The current implementation assumes block‑aligned token layouts; models that use irregular tokenization (e.g., variable‑size patches) may need additional engineering.
Scalability to ultra‑high resolution: While the paper shows results up to 720p, the block‑level sparsity may diminish at 4K resolutions where long‑range temporal dependencies become more critical.
Future directions: Extending CalibAtt to dynamic, data‑dependent sparsity (e.g., using a lightweight predictor at inference) could capture rare but important connections while retaining most of the speed gains. The authors also suggest exploring joint calibration with quantization or mixed‑precision techniques for even larger throughput improvements.

Authors

Shai Yehezkel
Shahar Yadin
Noam Elata
Yaron Ostrovsky-Berman
Bahjat Kawar

Paper Information

arXiv ID: 2603.05503v1
Categories: cs.CV
Published: March 5, 2026
PDF: Download PDF

[Paper] Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

[Paper] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

[Paper] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

[Paper] Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields