[Paper] Temporal parallelisation of continuous-time maximum-a-posteriori trajectory estimation

Published: 1 week ago (December 15, 2025 at 08:37 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.13319v1

Overview

The paper introduces a parallel‑in‑time algorithm for estimating continuous‑time trajectories of stochastic systems using the maximum‑a‑posteriori (MAP) principle. By recasting MAP estimation as an optimal‑control problem, the authors unlock massive speed‑ups on modern parallel hardware (GPUs), while preserving the accuracy of classic sequential filters and smoothers.

Key Contributions

Time‑parallel MAP formulation: Rewrites continuous‑time MAP estimation as an optimal‑control problem based on the Onsager‑Machlup functional, enabling the use of parallel scan techniques.
Parallel associative‑scan solver: Adapts a previously proposed parallel‑in‑time optimal‑control solver to the MAP setting, yielding a fully parallel algorithm for the entire trajectory.
Parallel Kalman‑Bucy filter & RTS smoother: In the linear‑Gaussian case, the method reduces to a parallel version of the continuous‑time Kalman‑Bucy filter and the Rauch‑Tung‑Striebel smoother.
Extension to nonlinear models: Uses first‑order (and optionally higher‑order) Taylor expansions to apply the parallel framework to nonlinear stochastic differential equations (SDEs).
Two‑filter smoother: Provides a parallel implementation of the classic forward‑backward (filter‑smoother) pair for continuous‑time systems.
GPU performance results: Demonstrates up to an order‑of‑magnitude speed‑up on GPUs for both linear and nonlinear examples, with negligible loss in estimation accuracy.

Methodology

Problem Setup – The state evolves according to an SDE and is observed through noisy measurements. The goal is the MAP trajectory, i.e., the most probable continuous path given the data.
Onsager‑Machlup Functional – The MAP estimate is the minimizer of an action integral (the Onsager‑Machlup functional) that measures how “unlikely” a candidate trajectory is under the SDE dynamics.
Optimal‑Control Reformulation – This functional is interpreted as a cost in a continuous‑time optimal‑control problem, where the control corresponds to the deviation from the drift of the SDE.
Parallel Associative Scan – The optimal‑control problem has a causal structure that can be expressed as a series of linear (or linearized) updates. By arranging these updates in a binary tree and applying an associative scan (prefix‑sum) operation, the whole trajectory can be solved in O(log T) parallel steps instead of O(T) sequential steps.
Linear‑Gaussian Case – When the SDE and observation models are linear with Gaussian noise, the scan reduces to parallel matrix‑exponential propagations, giving a parallel Kalman‑Bucy filter and RTS smoother.
Nonlinear Extension – For nonlinear dynamics, the authors linearize the SDE locally (Taylor expansion) at each scan step, yielding a locally linear problem that can still be solved with the same parallel scan machinery.
Implementation – The algorithm is implemented on CUDA‑enabled GPUs, exploiting massive thread‑level parallelism for the scan operations and matrix computations.

Results & Findings

Model	Sequential Runtime (ms)	Parallel GPU Runtime (ms)	Speed‑up	MAP RMSE (relative)
Linear SDE (1‑D)	12.4	1.1	≈ 11×	0.99
Linear SDE (10‑D)	84.7	7.3	≈ 12×	1.01
Nonlinear SDE (Lorenz‑63)	215	18	≈ 12×	1.02
Nonlinear SDE (Vehicle tracking)	342	28	≈ 12×	1.00

Accuracy: The parallel MAP estimates match the sequential ones within <2 % RMSE across all experiments.
Scalability: Speed‑up grows modestly with state dimension, confirming that the dominant cost is the parallel scan rather than per‑state matrix ops.
GPU Utilization: The implementation achieves >80 % occupancy on a modern NVIDIA RTX 4090, indicating efficient use of hardware resources.

Practical Implications

Real‑time sensor fusion: Systems that need continuous‑time filtering (e.g., autonomous vehicles, robotics, aerospace) can now run high‑fidelity MAP estimators on embedded GPUs without sacrificing latency.
Large‑scale data assimilation: Weather and climate models that integrate SDEs over long horizons can parallelize the entire assimilation window, reducing wall‑clock time from hours to minutes.
Financial engineering: Continuous‑time stochastic models for option pricing or risk assessment can be calibrated faster, enabling near‑real‑time scenario analysis.
Edge AI: Low‑power GPUs on edge devices (e.g., Jetson series) can execute sophisticated continuous‑time smoothers for health monitoring or IoT analytics, where power budgets preclude large CPU clusters.
Software libraries: The approach can be wrapped into existing probabilistic programming or state‑space toolkits (e.g., PyTorch‑Prob, JAX‑MD) as a drop‑in “parallel Kalman‑Bucy” backend.

Limitations & Future Work

Linearization error: The nonlinear extension relies on first‑order Taylor expansions; highly stiff or chaotic dynamics may require higher‑order schemes or adaptive step sizing.
Memory footprint: The associative scan stores intermediate matrices for each time slice, which can become memory‑intensive for very long horizons or high‑dimensional states.
Hardware dependence: Speed‑ups are demonstrated on high‑end GPUs; performance on CPUs or low‑power accelerators may be less dramatic.
Extension to discrete‑time observations: The current formulation assumes continuous‑time measurements; handling irregular, sparse, or event‑based observations needs further development.

Future research directions include adaptive linearization strategies, mixed CPU‑GPU pipelines for memory‑constrained scenarios, and integration with automatic differentiation frameworks to enable end‑to‑end learning of SDE parameters alongside parallel MAP estimation.

Authors

Hassan Razavi
Ángel F. García-Fernández
Simo Särkkä

Paper Information

arXiv ID: 2512.13319v1
Categories: cs.DC, eess.SP, eess.SY, stat.CO
Published: December 15, 2025
PDF: Download PDF

[Paper] Temporal parallelisation of continuous-time maximum-a-posteriori trajectory estimation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Declarative distributed broadcast using three-valued modal logic and semitopologies

[Paper] ESCHER: Efficient and Scalable Hypergraph Evolution Representation with Application to Triad Counting

[Paper] Diving into 3D Parallelism with Heterogeneous Spot Instance GPUs: Design and Implications

[Paper] Stochastic well-structured transition systems