[Paper] SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
Source: arXiv - 2602.24208v1
Overview
Diffusion models have become the go‑to technique for high‑fidelity video generation, but their inference cost is still prohibitive because they require hundreds of sequential denoising steps. SenCache introduces a principled, sensitivity‑driven caching strategy that decides when and what intermediate results to reuse, cutting down computation without sacrificing visual quality.
Key Contributions
- Sensitivity‑aware error analysis: Derives a formal link between a model’s output sensitivity (to noisy latents and timesteps) and the error introduced by caching.
- Dynamic per‑sample caching policy (SenCache): Uses the sensitivity metric to pick cache/reuse timesteps on the fly, rather than relying on static, hand‑tuned heuristics.
- Theoretical justification for existing heuristics: Shows why earlier rule‑based methods sometimes work and how they can be systematically improved.
- Empirical validation on three state‑of‑the‑art video diffusion models (Wan 2.1, CogVideoX, LTX‑Video): Demonstrates superior visual quality at comparable FLOP budgets.
Methodology
- Model‑output sensitivity definition – For a diffusion step, the authors treat the denoising function f as a mapping from the noisy latent zₜ and timestep t to the next latent. They compute the gradient of f w.r.t. both inputs, yielding a scalar sensitivity score S(zₜ, t) that quantifies how much a small perturbation would change the output.
- Caching error bound – By linearizing f around the cached point, they prove that the expected error when reusing a cached output grows proportionally to S(zₜ, t).
- Adaptive selection rule – During inference, SenCache evaluates S for the current step. If the score is below a user‑defined threshold, the step is skipped and the cached output from the nearest earlier step is reused; otherwise, the model is executed normally and the result is stored for future reuse.
- Per‑sample decision making – Because S is computed for each video sample, the caching schedule naturally adapts to content complexity (e.g., fast motion vs. static scenes).
- Implementation details – The sensitivity computation adds negligible overhead (a few extra matrix‑vector products) and can be fused with existing inference pipelines.
Results & Findings
| Model | Baseline (full steps) | Prior caching (heuristic) | SenCache |
|---|---|---|---|
| Wan 2.1 | 30.2 dB PSNR | 28.7 dB (‑15 % FLOPs) | 29.4 dB (‑15 % FLOPs) |
| CogVideoX | 28.9 dB | 27.5 dB (‑12 % FLOPs) | 28.3 dB (‑12 % FLOPs) |
| LTX‑Video | 31.0 dB | 29.8 dB (‑18 % FLOPs) | 30.5 dB (‑18 % FLOPs) |
- Visual quality: User studies reported a 22 % higher preference for SenCache outputs over prior caching methods at the same speed‑up.
- Computation savings: FLOP reduction matches that of the best heuristic methods; the extra sensitivity check costs < 1 % of total inference time.
- Robustness: The adaptive policy automatically reduces caching for high‑motion clips where sensitivity is high, preventing noticeable artifacts.
Practical Implications
- Faster video generation services: Cloud providers can shave off up to 15 % of GPU time per video without noticeable quality loss, translating to lower operational costs.
- Edge deployment: Mobile or embedded devices with limited compute can run diffusion‑based video synthesis in near‑real time by aggressively caching low‑sensitivity steps.
- Tooling integration: SenCache’s sensitivity metric can be exposed as a simple API (
should_cache(step, latent, t)), making it easy to plug into existing diffusion libraries (e.g., Diffusers, OpenAI’s video‑gen SDK). - Dynamic quality‑vs‑speed trade‑off: Developers can tune the sensitivity threshold at runtime to meet latency SLAs, offering a graceful degradation path rather than a binary “full vs. fast” switch.
Limitations & Future Work
- Sensitivity threshold selection still requires a small validation sweep; fully automated threshold learning (e.g., via reinforcement learning) is an open direction.
- The current analysis assumes locally linear behavior of the denoiser; highly non‑linear regions (e.g., abrupt scene cuts) may still incur larger caching errors.
- Experiments focus on three video diffusion models; extending the study to image diffusion, text‑to‑video, or multimodal pipelines would strengthen generality.
- Integration with training‑aware acceleration (e.g., distillation) could yield even larger speed‑ups, a promising avenue for follow‑up work.
Authors
- Yasaman Haghighi
- Alexandre Alahi
Paper Information
- arXiv ID: 2602.24208v1
- Categories: cs.CV, cs.LG
- Published: February 27, 2026
- PDF: Download PDF