[Paper] SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Published: (February 27, 2026 at 12:36 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.24208v1

Overview

Diffusion models have become the go‑to technique for high‑fidelity video generation, but their inference cost is still prohibitive because they require hundreds of sequential denoising steps. SenCache introduces a principled, sensitivity‑driven caching strategy that decides when and what intermediate results to reuse, cutting down computation without sacrificing visual quality.

Key Contributions

  • Sensitivity‑aware error analysis: Derives a formal link between a model’s output sensitivity (to noisy latents and timesteps) and the error introduced by caching.
  • Dynamic per‑sample caching policy (SenCache): Uses the sensitivity metric to pick cache/reuse timesteps on the fly, rather than relying on static, hand‑tuned heuristics.
  • Theoretical justification for existing heuristics: Shows why earlier rule‑based methods sometimes work and how they can be systematically improved.
  • Empirical validation on three state‑of‑the‑art video diffusion models (Wan 2.1, CogVideoX, LTX‑Video): Demonstrates superior visual quality at comparable FLOP budgets.

Methodology

  1. Model‑output sensitivity definition – For a diffusion step, the authors treat the denoising function f as a mapping from the noisy latent zₜ and timestep t to the next latent. They compute the gradient of f w.r.t. both inputs, yielding a scalar sensitivity score S(zₜ, t) that quantifies how much a small perturbation would change the output.
  2. Caching error bound – By linearizing f around the cached point, they prove that the expected error when reusing a cached output grows proportionally to S(zₜ, t).
  3. Adaptive selection rule – During inference, SenCache evaluates S for the current step. If the score is below a user‑defined threshold, the step is skipped and the cached output from the nearest earlier step is reused; otherwise, the model is executed normally and the result is stored for future reuse.
  4. Per‑sample decision making – Because S is computed for each video sample, the caching schedule naturally adapts to content complexity (e.g., fast motion vs. static scenes).
  5. Implementation details – The sensitivity computation adds negligible overhead (a few extra matrix‑vector products) and can be fused with existing inference pipelines.

Results & Findings

ModelBaseline (full steps)Prior caching (heuristic)SenCache
Wan 2.130.2 dB PSNR28.7 dB (‑15 % FLOPs)29.4 dB (‑15 % FLOPs)
CogVideoX28.9 dB27.5 dB (‑12 % FLOPs)28.3 dB (‑12 % FLOPs)
LTX‑Video31.0 dB29.8 dB (‑18 % FLOPs)30.5 dB (‑18 % FLOPs)
  • Visual quality: User studies reported a 22 % higher preference for SenCache outputs over prior caching methods at the same speed‑up.
  • Computation savings: FLOP reduction matches that of the best heuristic methods; the extra sensitivity check costs < 1 % of total inference time.
  • Robustness: The adaptive policy automatically reduces caching for high‑motion clips where sensitivity is high, preventing noticeable artifacts.

Practical Implications

  • Faster video generation services: Cloud providers can shave off up to 15 % of GPU time per video without noticeable quality loss, translating to lower operational costs.
  • Edge deployment: Mobile or embedded devices with limited compute can run diffusion‑based video synthesis in near‑real time by aggressively caching low‑sensitivity steps.
  • Tooling integration: SenCache’s sensitivity metric can be exposed as a simple API (should_cache(step, latent, t)), making it easy to plug into existing diffusion libraries (e.g., Diffusers, OpenAI’s video‑gen SDK).
  • Dynamic quality‑vs‑speed trade‑off: Developers can tune the sensitivity threshold at runtime to meet latency SLAs, offering a graceful degradation path rather than a binary “full vs. fast” switch.

Limitations & Future Work

  • Sensitivity threshold selection still requires a small validation sweep; fully automated threshold learning (e.g., via reinforcement learning) is an open direction.
  • The current analysis assumes locally linear behavior of the denoiser; highly non‑linear regions (e.g., abrupt scene cuts) may still incur larger caching errors.
  • Experiments focus on three video diffusion models; extending the study to image diffusion, text‑to‑video, or multimodal pipelines would strengthen generality.
  • Integration with training‑aware acceleration (e.g., distillation) could yield even larger speed‑ups, a promising avenue for follow‑up work.

Authors

  • Yasaman Haghighi
  • Alexandre Alahi

Paper Information

  • arXiv ID: 2602.24208v1
  • Categories: cs.CV, cs.LG
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »