Tiny Diffusion
Source: Dev.to
Forward Diffusion
Diffusion is a process of transferring energy from a high‑energy state to a lower‑energy state, as described in thermodynamics. In diffusion models we mimic this by gradually adding noise to an image.
Imagine dropping ink into a glass of water: the ink starts as a distinct shape and then spreads until the water becomes uniformly colored.
Analogously, we start with a clean image and repeatedly add a small amount of noise at each step until the image becomes pure noise. This is called forward diffusion.
Reverse Diffusion
If we filmed the ink‑diffusing process and played the video backward, we would see the uniform color reform into the original ink dot.
During training we teach the model to predict the added noise and subtract it, effectively learning how to reverse the diffusion process. This is reverse diffusion.
A scheduler determines how much noise is added at each timestep, and an algorithm (often a U‑Net or a diffusion transformer) learns to remove it.
Extending Diffusion to Video
For 2‑D diffusion we only need to denoise a single frame. Video adds a third dimension—time—so we must enforce temporal consistency.
If frame 1 is denoised into a “3” and frame 2 into a “5”, the resulting animation would flicker. To avoid this, the model uses temporal attention, looking at neighboring frames before and after the current one.
In our experiment we converted a static 28×28 MNIST digit into a batch of 15 frames using a Euclidean‑distance transform with masking. Each digit thus became a sequence of 15 images representing its transformation over time, providing the data needed for a video‑aware architecture.
Architecture
The backbone follows the DDPM paper but is adapted for video:
- Spatial branch – processes spatial information with a kernel of size (1 × 3 × 3).
- Temporal branch – processes temporal information with a kernel of size (3 × 1 × 1).
# Spatial convolution
nn.Conv3d(in_ch, out_ch, kernel_size=(1, 3, 3), padding=(0, 1, 1)),
nn.BatchNorm3d(out_ch),
nn.ReLU(),
# Temporal convolution
nn.Conv3d(out_ch, out_ch, kernel_size=(3, 1, 1), padding=(1, 0, 0))
Time Embedding
The original DDPM uses sinusoidal embeddings (similar to transformer positional encodings). For our simple task we employed a basic MLP:
self.time_mlp = nn.Sequential(
nn.Linear(1, t_dim),
nn.ReLU(),
nn.Linear(t_dim, t_dim)
)
Because the task involved predicting geometry from Euclidean Distance Transform (a low‑frequency, linear function), the simple embedding was sufficient. Real video diffusion, which contains both low‑ and high‑frequency components and complex physics, typically benefits from sinusoidal embeddings.
Scheduler and Training Loop
A scheduler decides how much noise to add at each step. In the original DDPM, reaching step 500 requires 500 successive noise additions, which is slow. By exploiting the property that adding Gaussian noise to Gaussian noise yields another Gaussian, we can “teleport” from the clean image (x_0) directly to any noisy step (x_t).
The loss is straightforward: take a clean frame, add noise at a random timestep, predict the noise, compute the mean‑squared error (MSE) against the true noise, and back‑propagate.
During training we can jump steps using the re‑parameterization trick, but during sampling we must follow the reverse diffusion chain. The standard DDPM requires ~1000 steps; a faster alternative is DDIM, which can generate samples in as few as 50 steps. After the model predicts the noise to subtract, the sampler performs the subtraction.
Temporal Consistency (The “Secret Sauce”)
Ensuring that consecutive frames remain coherent is crucial for video diffusion. Temporal attention and the dual‑branch (spatial + temporal) convolution design help maintain this consistency throughout the reverse diffusion process.
Conclusion
The project demonstrates a simplified approach to video diffusion:
- Data: transformed static MNIST digits into short temporal sequences.
- Architecture: split 3‑D convolutions into separate spatial and temporal branches.
- Embedding: simple MLP time embedding sufficed for the low‑frequency task.
- Training: standard DDPM loss with a scheduler; optional DDIM sampling for speed.
While this prototype does not capture the full complexity of modern video diffusion models (which often condition on text and sophisticated temporal priors), it provides a clear conceptual foundation.
For a deeper conceptual understanding, check out the videos by 3Blue1Brown and Welch Lab on how AI images and video work.
You can also explore the full code on my GitHub repository.