[Paper] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders
Source: arXiv - 2512.13690v1
Overview
DiffusionBrowser introduces a lightweight, model‑agnostic decoder that lets users peek inside the denoising steps of video diffusion models and even steer the generation on the fly. By producing high‑fidelity RGB previews and scene‑intrinsic maps in under a second for a 4‑second clip, it turns a traditionally opaque, slow process into an interactive experience.
Key Contributions
- Interactive preview decoder: A multi‑branch decoder that can generate RGB frames and auxiliary modalities (depth, segmentation, optical flow, etc.) from any intermediate timestep or transformer block.
- Real‑time performance: Achieves >4× faster‑than‑real‑time preview generation (≈ 0.2 s per second of video).
- Stochasticity reinjection & modal steering: Enables users to re‑introduce randomness or bias specific modalities (e.g., depth) at intermediate steps, giving fine‑grained control over the final video.
- Model‑agnostic design: Works with any pretrained video diffusion backbone without retraining the backbone itself.
- Interpretability toolkit: Uses the learned decoders to probe how scene layout, object identity, and motion are progressively assembled during denoising.
Methodology
- Base diffusion model – The authors start with any off‑the‑shelf video diffusion model (e.g., Imagen‑Video, Make‑A‑Video) that iteratively denoises a noisy latent sequence.
- Multi‑branch decoder – A small, trainable network is attached to the diffusion backbone. It receives the hidden states from a chosen timestep or transformer layer and simultaneously predicts:
- RGB frames (the visual preview)
- Scene intrinsics such as depth, semantic masks, and optical flow.
The decoder is trained with a lightweight supervision loss that aligns its outputs with the ground‑truth video and its derived modalities.
- Interactive loop – During inference, a user can pause the diffusion process at any step, request a preview from the decoder, and optionally modify the latent (e.g., add noise back or inject a depth cue). The diffusion then continues from the altered state.
- Probing analysis – By extracting decoder outputs across timesteps, the authors visualize how high‑level concepts (objects, layout) emerge, offering a new lens into the black‑box denoising dynamics.
Results & Findings
- Speed: The decoder renders a 4‑second video preview in < 1 s, a 4× speed‑up over generating the full video with the original diffusion model.
- Quality: Preview frames retain consistent color palettes, motion trajectories, and coarse geometry compared to the final output, with an average LPIPS reduction of 0.12 relative to the full‑resolution video.
- Control: Stochasticity reinjection at early steps can dramatically alter scene composition, while modal steering (e.g., fixing depth) preserves layout while allowing style changes.
- Interpretability: Visualizations show that scene layout (depth, segmentation) solidifies early (≈ t = 0.7 T), whereas fine texture and color details refine in later steps, confirming hypotheses about diffusion’s coarse‑to‑fine nature.
Practical Implications
- Rapid prototyping: Creators can iterate on video concepts in seconds instead of minutes, dramatically shortening the feedback loop for storyboarding, UI animation, or ad mock‑ups.
- Interactive editing tools: Integration into video editors (e.g., After Effects plugins) could let artists pause a generative run, adjust depth or motion, and resume, achieving “live” diffusion editing.
- Debugging & safety: Developers building generative pipelines can use the preview decoder to spot undesirable artifacts early, reducing compute waste and mitigating harmful outputs before full synthesis.
- Cross‑modal applications: Since the decoder outputs depth, segmentation, and flow, downstream tasks (e.g., AR placement, collision detection) can leverage these intermediate cues without waiting for the final video.
Limitations & Future Work
- Decoder capacity vs. fidelity: The lightweight decoder trades off fine‑grained texture detail for speed; extremely high‑resolution previews may still lag.
- Dependency on backbone quality: While model‑agnostic, the preview quality is bounded by the underlying diffusion model’s representation power.
- User interface design: The paper demonstrates the technical feasibility but leaves the ergonomics of interactive control (e.g., UI widgets for stochasticity reinjection) to future exploration.
- Extending to other modalities: Future work could add audio previews or text‑to‑video conditioning, and explore training the decoder jointly with the diffusion backbone for tighter integration.
Authors
- Susung Hong
- Chongjian Ge
- Zhifei Zhang
- Jui‑Hsien Wang
Paper Information
- arXiv ID: 2512.13690v1
- Categories: cs.CV, cs.AI, cs.GR, cs.LG
- Published: December 15, 2025
- PDF: Download PDF