[Paper] ToTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography
Source: arXiv - 2601.04159v1
Overview
Remote photoplethysmography (rPPG) extracts a pulse waveform from ordinary facial video, opening the door to contact‑less health monitoring on smartphones, laptops, and IoT cameras. The new ToTMNet architecture shows that you can achieve state‑of‑the‑art heart‑rate accuracy with a model that fits comfortably on edge devices, thanks to a clever replacement of the usual attention‑based temporal encoder with an FFT‑accelerated Toeplitz mixing layer.
Key Contributions
- Toeplitz Temporal Mixing Layer – Introduces a linear‑parameter, full‑sequence temporal operator that can be executed in near‑linear time via FFT‑based convolution.
- Gated Temporal Mixer – Combines a lightweight depthwise temporal convolution (local context) with the global Toeplitz mixer, letting the network adaptively balance short‑ and long‑range temporal information.
- Ultra‑lightweight Design – The whole network contains only 63 k parameters, far fewer than typical attention‑based rPPG models, while still delivering sub‑1.1 bpm mean absolute error (MAE).
- Cross‑Domain Robustness – Demonstrates strong generalization from synthetic training data (SCAMPS) to real‑world videos (UBFC‑rPPG), highlighting the gating mechanism’s role in handling domain shift.
- Open‑source‑ready Implementation – The authors provide a PyTorch implementation that can be integrated into existing video‑processing pipelines with minimal overhead.
Methodology
- Input preprocessing – Face regions are detected and cropped from each video frame, then converted to a compact spatio‑temporal tensor (e.g., RGB channels over time).
- Feature extraction backbone – A shallow CNN extracts per‑frame spatial embeddings (color and texture cues linked to blood volume changes).
- Temporal modeling –
- Local branch: A depthwise 1‑D convolution with a small kernel (e.g., 3‑5 frames) captures short‑range dynamics.
- Global branch: The Toeplitz mixing layer builds a Toeplitz matrix from a learned kernel vector. Because a Toeplitz matrix is fully defined by its first row/column, the number of learnable parameters grows linearly with clip length, not quadratically.
- FFT acceleration: Multiplication with the Toeplitz matrix is performed as a convolution using circulant embedding, which can be computed via the Fast Fourier Transform (FFT) in O(N log N) time instead of O(N²).
- Gating: A sigmoid gate learns to weight the local and global branches per channel, allowing the network to emphasize whichever temporal scale is most informative for a given video segment.
- Regression head – The mixed temporal representation is passed through a tiny fully‑connected head that outputs the blood‑volume pulse (BVP) waveform, from which heart‑rate is derived via standard peak‑detection.
Results & Findings
| Dataset | Training | Test | MAE (bpm) | Pearson r |
|---|---|---|---|---|
| UBFC‑rPPG (intra‑dataset) | UBFC‑rPPG | UBFC‑rPPG | 1.055 | 0.996 |
| SCAMPS → UBFC‑rPPG (cross‑domain) | SCAMPS (synthetic) | UBFC‑rPPG (real) | 1.582 | 0.994 |
- Parameter efficiency: 63 k parameters vs. >1 M for many attention‑based rPPG nets.
- Speed: FFT‑based mixing runs at ~30 fps on a mid‑range mobile GPU (e.g., Snapdragon 8 Gen 2), well within real‑time constraints.
- Ablation: Removing the gating mechanism degrades cross‑domain MAE by ~0.4 bpm, confirming its importance for adapting to domain shift.
- Robustness: The model maintains high correlation even when video length varies, thanks to the full‑sequence receptive field of the Toeplitz operator.
Practical Implications
- Edge deployment – With a sub‑100 k parameter footprint and FFT‑friendly operations, ToTMNet can run on smartphones, wearables, or embedded cameras without offloading to the cloud.
- Real‑time health apps – Developers can embed heart‑rate monitoring into video‑chat, fitness, or telemedicine platforms, delivering instant vitals without extra hardware.
- Low‑power IoT – The linear‑time complexity translates to lower CPU/GPU utilization, extending battery life for continuous monitoring devices.
- Domain‑agnostic training – The gating‑enhanced Toeplitz mixer tolerates synthetic‑to‑real transfer, meaning you can pre‑train on large, cheap synthetic datasets and still achieve high accuracy on real user footage.
- Plug‑and‑play component – The Toeplitz mixing layer can replace attention modules in other video‑sequence models (e.g., action recognition, video captioning) where long‑range temporal dependencies matter but resources are limited.
Limitations & Future Work
- Dataset scope – Evaluation is limited to two datasets (one real, one synthetic). Wider testing on diverse lighting, motion, and skin tones is needed to confirm generalization.
- Fixed clip length – The current implementation assumes a predetermined sequence length for the Toeplitz kernel; handling variable‑length streams more gracefully could improve flexibility.
- Hardware‑specific FFT overhead – While FFT is fast on GPUs, on some microcontrollers the overhead may outweigh benefits; exploring alternative fast convolution schemes could broaden applicability.
- Extended vitals – Future work could extend the architecture to estimate respiration rate, blood oxygen saturation, or stress markers from the same video stream.
Bottom line: ToTMNet demonstrates that a mathematically elegant Toeplitz‑based temporal mixer, accelerated by FFT, can replace heavyweight attention while delivering high‑precision rPPG on resource‑constrained devices—an exciting development for developers building the next generation of contact‑less health monitoring solutions.
Authors
- Vladimir Frants
- Sos Agaian
- Karen Panetta
Paper Information
- arXiv ID: 2601.04159v1
- Categories: cs.CV
- Published: January 7, 2026
- PDF: Download PDF