[Paper] TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos
Source: arXiv - 2602.16711v1
Overview
The paper “TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos” tackles a core bottleneck of neural‑based video compression: the need to train a separate implicit neural representation (INR) for every video, which quickly becomes impractical for high‑resolution content. By redesigning how hypernetworks predict INR weights across space and time, the authors achieve dramatically lower memory footprints, faster encoding, and higher visual quality—making neural video codecs a realistic option for real‑world pipelines.
Key Contributions
- Spatial‑Temporal Weight Decomposition: Breaks a video into short patch‑tubelets (small spatial patches over a few frames) and predicts INR weights for each tubelet independently, cutting pre‑training memory by ~20×.
- Residual‑Based Storage Scheme: Stores only the differences between consecutive segment representations, shrinking the final bitstream without sacrificing fidelity.
- Temporal Coherence Regularization: Adds a loss that aligns changes in the weight space with actual video motion, encouraging smoother, more predictable weight updates across frames.
- State‑of‑the‑Art Performance: Delivers +2.47 dB (480p) and +5.35 dB (720p) PSNR over the previous hypernetwork baseline, with 36 % lower bitrates and 1.5‑3× faster encoding.
- Scalable to 1080p: First hypernetwork‑based method to demonstrate competitive results on 480p, 720p, and 1080p benchmarks (UVG, HEVC, MCL‑JCV) while staying within modest GPU memory limits.
Methodology
-
Patch‑Tubelet Partitioning
- The input video is sliced into overlapping spatial patches (e.g., 32×32 pixels).
- For each patch, a short temporal window (typically 4‑8 frames) forms a tubelet.
- This reduces the dimensionality of the weight‑prediction problem because each hypernetwork only needs to model a tiny spatio‑temporal chunk rather than the whole frame sequence.
-
Hypernetwork Design
- A lightweight hypernetwork takes the raw pixel values of a tubelet and outputs the parameters of a tiny INR (a multilayer perceptron that maps (x, y, t) → RGB).
- Because tubelets are small, the hypernetwork can be trained on a single GPU with far less memory than a monolithic video‑wide hypernetwork.
-
Residual Weight Encoding
- After the hypernetwork predicts weights for tubelet i, the system computes the residual with respect to tubelet i‑1.
- Only these residuals are entropy‑coded, exploiting the fact that adjacent tubelets (both spatially and temporally) often have very similar weight patterns.
-
Temporal Coherence Regularizer
- An auxiliary loss penalizes weight changes that are not aligned with the underlying motion field (estimated via a simple optical‑flow or block‑matching step).
- This encourages the hypernetwork to produce weight trajectories that “follow” the video’s actual temporal dynamics, leading to smoother reconstructions and easier residual compression.
-
Training & Inference Pipeline
- The hypernetwork is pre‑trained on a large corpus of video patches.
- At test time, for a new video, the hypernetwork is fine‑tuned on its own patches (a few gradient steps) to adapt to the specific content, then the residuals are encoded and streamed.
Results & Findings
| Resolution | Dataset | PSNR (baseline) | PSNR (TeCoNeRV) | Bitrate Reduction | Encoding Speedup |
|---|---|---|---|---|---|
| 480p | UVG | 31.2 dB | 33.7 dB | 36 % | 1.8× |
| 720p | UVG | 28.9 dB | 34.2 dB | 36 % | 2.2× |
| 1080p | HEVC | — | ≈34 dB | — | 1.5× |
- Quality boost stems mainly from the temporal coherence regularizer, which reduces flickering and ringing artifacts.
- Memory usage drops from >30 GB (full‑frame hypernetwork) to <1.5 GB, enabling training on a single RTX 3090.
- Bitstream size shrinks because residuals are highly compressible; entropy coding achieves near‑optimal rates compared to raw weight storage.
Practical Implications
- Edge‑Device Video Streaming: The low‑memory, fast‑encoding pipeline makes it feasible to generate neural‑compressed streams on‑the‑fly on devices with limited VRAM (e.g., smartphones, embedded GPUs).
- Adaptive Bitrate (ABR) Systems: Since each tubelet can be encoded independently, a server could dynamically adjust the residual bitrate per segment based on network conditions, similar to modern DASH/HLS chunking.
- Content‑Aware Editing: Because the INR parameters are explicitly tied to spatio‑temporal patches, developers can manipulate individual tubelets (e.g., replace a patch with a higher‑quality version) without re‑encoding the whole video.
- Integration with Existing Codecs: TeCoNeRV’s residuals can be fused with traditional codecs (e.g., as a supplemental enhancement layer), offering a hybrid approach that leverages the robustness of HEVC while gaining the flexibility of neural representations.
- Research‑to‑Product Path: The modular design (patch‑tubelet hypernetwork + residual encoder) aligns well with micro‑service architectures, allowing teams to replace or upgrade components (e.g., swapping the optical‑flow estimator) without redesigning the whole system.
Limitations & Future Work
- Fine‑Tuning Overhead: Although encoding is faster than prior hypernetwork methods, a short fine‑tuning phase is still required per video, which may be a hurdle for ultra‑low‑latency scenarios.
- Patch Boundary Artifacts: The independent treatment of tubelets can introduce seams at patch borders; the authors mitigate this with overlap‑and‑average but a more sophisticated blending could improve visual continuity.
- Scalability Beyond 1080p: While 1080p results are promising, memory and compute demands still grow with higher resolutions; hierarchical tubelet schemes or mixed‑precision training are potential remedies.
- Generalization to Diverse Content: The method was evaluated on standard benchmark datasets; performance on highly dynamic or procedurally generated content (e.g., video games, VR) remains an open question.
Future research directions include end‑to‑end joint optimization of the hypernetwork and residual coder, learned motion estimation for tighter temporal coherence, and exploring transformer‑based hypernetworks that can capture longer‑range dependencies without exploding memory.
Authors
- Namitha Padmanabhan
- Matthew Gwilliam
- Abhinav Shrivastava
Paper Information
- arXiv ID: 2602.16711v1
- Categories: cs.CV
- Published: February 18, 2026
- PDF: Download PDF