[Paper] ToTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography

Published: 1 month ago (January 7, 2026 at 01:15 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.04159v1

Overview

Remote photoplethysmography (rPPG) extracts a pulse waveform from ordinary facial video, opening the door to contact‑less health monitoring on smartphones, laptops, and IoT cameras. The new ToTMNet architecture shows that you can achieve state‑of‑the‑art heart‑rate accuracy with a model that fits comfortably on edge devices, thanks to a clever replacement of the usual attention‑based temporal encoder with an FFT‑accelerated Toeplitz mixing layer.

Key Contributions

Toeplitz Temporal Mixing Layer – Introduces a linear‑parameter, full‑sequence temporal operator that can be executed in near‑linear time via FFT‑based convolution.
Gated Temporal Mixer – Combines a lightweight depthwise temporal convolution (local context) with the global Toeplitz mixer, letting the network adaptively balance short‑ and long‑range temporal information.
Ultra‑lightweight Design – The whole network contains only 63 k parameters, far fewer than typical attention‑based rPPG models, while still delivering sub‑1.1 bpm mean absolute error (MAE).
Cross‑Domain Robustness – Demonstrates strong generalization from synthetic training data (SCAMPS) to real‑world videos (UBFC‑rPPG), highlighting the gating mechanism’s role in handling domain shift.
Open‑source‑ready Implementation – The authors provide a PyTorch implementation that can be integrated into existing video‑processing pipelines with minimal overhead.

Methodology

Input preprocessing – Face regions are detected and cropped from each video frame, then converted to a compact spatio‑temporal tensor (e.g., RGB channels over time).
Feature extraction backbone – A shallow CNN extracts per‑frame spatial embeddings (color and texture cues linked to blood volume changes).
Temporal modeling –
- Local branch: A depthwise 1‑D convolution with a small kernel (e.g., 3‑5 frames) captures short‑range dynamics.
- Global branch: The Toeplitz mixing layer builds a Toeplitz matrix from a learned kernel vector. Because a Toeplitz matrix is fully defined by its first row/column, the number of learnable parameters grows linearly with clip length, not quadratically.
- FFT acceleration: Multiplication with the Toeplitz matrix is performed as a convolution using circulant embedding, which can be computed via the Fast Fourier Transform (FFT) in O(N log N) time instead of O(N²).
- Gating: A sigmoid gate learns to weight the local and global branches per channel, allowing the network to emphasize whichever temporal scale is most informative for a given video segment.
Regression head – The mixed temporal representation is passed through a tiny fully‑connected head that outputs the blood‑volume pulse (BVP) waveform, from which heart‑rate is derived via standard peak‑detection.

Results & Findings

Dataset	Training	Test	MAE (bpm)	Pearson r
UBFC‑rPPG (intra‑dataset)	UBFC‑rPPG	UBFC‑rPPG	1.055	0.996
SCAMPS → UBFC‑rPPG (cross‑domain)	SCAMPS (synthetic)	UBFC‑rPPG (real)	1.582	0.994

Parameter efficiency: 63 k parameters vs. >1 M for many attention‑based rPPG nets.
Speed: FFT‑based mixing runs at ~30 fps on a mid‑range mobile GPU (e.g., Snapdragon 8 Gen 2), well within real‑time constraints.
Ablation: Removing the gating mechanism degrades cross‑domain MAE by ~0.4 bpm, confirming its importance for adapting to domain shift.
Robustness: The model maintains high correlation even when video length varies, thanks to the full‑sequence receptive field of the Toeplitz operator.

Practical Implications

Edge deployment – With a sub‑100 k parameter footprint and FFT‑friendly operations, ToTMNet can run on smartphones, wearables, or embedded cameras without offloading to the cloud.
Real‑time health apps – Developers can embed heart‑rate monitoring into video‑chat, fitness, or telemedicine platforms, delivering instant vitals without extra hardware.
Low‑power IoT – The linear‑time complexity translates to lower CPU/GPU utilization, extending battery life for continuous monitoring devices.
Domain‑agnostic training – The gating‑enhanced Toeplitz mixer tolerates synthetic‑to‑real transfer, meaning you can pre‑train on large, cheap synthetic datasets and still achieve high accuracy on real user footage.
Plug‑and‑play component – The Toeplitz mixing layer can replace attention modules in other video‑sequence models (e.g., action recognition, video captioning) where long‑range temporal dependencies matter but resources are limited.

Limitations & Future Work

Dataset scope – Evaluation is limited to two datasets (one real, one synthetic). Wider testing on diverse lighting, motion, and skin tones is needed to confirm generalization.
Fixed clip length – The current implementation assumes a predetermined sequence length for the Toeplitz kernel; handling variable‑length streams more gracefully could improve flexibility.
Hardware‑specific FFT overhead – While FFT is fast on GPUs, on some microcontrollers the overhead may outweigh benefits; exploring alternative fast convolution schemes could broaden applicability.
Extended vitals – Future work could extend the architecture to estimate respiration rate, blood oxygen saturation, or stress markers from the same video stream.

Bottom line: ToTMNet demonstrates that a mathematically elegant Toeplitz‑based temporal mixer, accelerated by FFT, can replace heavyweight attention while delivering high‑precision rPPG on resource‑constrained devices—an exciting development for developers building the next generation of contact‑less health monitoring solutions.

Authors

Vladimir Frants
Sos Agaian
Karen Panetta

Paper Information

arXiv ID: 2601.04159v1
Categories: cs.CV
Published: January 7, 2026
PDF: Download PDF

[Paper] ToTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame &amp; Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction