[Paper] Accelerated Execution of Bayesian Neural Networks using a Single Probabilistic Forward Pass and Code Generation

Published: (November 28, 2025 at 01:35 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23440v1

Overview

This paper tackles one of the biggest hurdles for Bayesian Neural Networks (BNNs): the heavy computational cost of propagating uncertainty. By introducing a Probabilistic Forward Pass (PFP) that replaces costly Monte‑Carlo sampling with a single deterministic pass, the authors show how to train, compile, and run BNNs efficiently on low‑power ARM CPUs. The result is a practical pipeline that brings trustworthy, uncertainty‑aware inference to embedded devices.

Key Contributions

  • Probabilistic Forward Pass (PFP): An analytic approximation of Stochastic Variational Inference (SVI) that assumes Gaussian‑distributed weights and activations, enabling a single‑pass uncertainty propagation.
  • End‑to‑end deployment pipeline: From training to code generation, the authors integrate PFP‑BNNs with the TVM compiler and a custom library of Gaussian‑propagating operators for MLPs and CNNs.
  • Heavy‑weight optimization: Combination of manual operator design, TVM’s auto‑tuning, and ARM‑specific code generation yields up to 4200× speed‑up for small batch inference compared with traditional SVI.
  • Comprehensive evaluation: On the Dirty‑MNIST benchmark, PFP‑BNNs match SVI‑BNNs in classification accuracy, calibrated uncertainty, and out‑of‑domain (OOD) detection while drastically cutting compute time.
  • Open‑source artifacts: The paper provides the TVM operator library and tuning scripts, facilitating reproducibility and further research.

Methodology

  1. Gaussian Assumption: Both weights and intermediate activations are modeled as independent Gaussian random variables. This permits closed‑form formulas for the mean and variance after each linear or convolutional layer.
  2. Probabilistic Operators: Custom TVM operators compute the propagated mean and variance in a single forward pass, eliminating the need for Monte‑Carlo weight sampling.
  3. Training Pipeline: The network is trained with a variational objective (KL divergence + likelihood) using standard stochastic gradient descent, but the forward pass during training already follows the PFP formulation.
  4. Code Generation & Tuning:
    • The high‑level PFP graph is lowered to TVM’s intermediate representation.
    • ARM‑specific schedules (vectorization, tiling, loop unrolling) are applied.
    • Auto‑tuning searches the schedule space to find the fastest kernel configuration for each operator.
  5. Deployment: The tuned kernels are compiled into a static library that can be linked to any ARM‑based runtime (e.g., Raspberry Pi, microcontrollers with Cortex‑M cores).

Results & Findings

MetricSVI‑BNN (baseline)PFP‑BNN (this work)
Inference latency (batch = 1)~120 ms (ARM Cortex‑A53)≈ 0.03 ms (≈ 4200× faster)
Classification accuracy (Dirty‑MNIST)92.1 %92.0 %
Expected Calibration Error (ECE)0.0450.047
OOD detection AUROC0.890.88
Memory footprint (model + buffers)12 MB9 MB (≈ 25 % reduction)

The numbers show that PFP retains the predictive performance and uncertainty quality of full SVI while slashing both runtime and memory usage. Ablation studies confirm that the speed‑up stems primarily from the single‑pass formulation; the TVM optimizations add another 2–3× improvement on top of the raw analytic gains.

Practical Implications

  • Edge AI with safety guarantees: Developers can now embed Bayesian inference in devices like drones, wearables, or industrial sensors without sacrificing real‑time constraints.
  • Reduced power consumption: Fewer arithmetic operations and memory accesses translate directly into lower energy draw—critical for battery‑operated systems.
  • Simplified deployment workflow: By leveraging TVM, the same high‑level model definition can be compiled for a wide range of ARM targets, avoiding hand‑written assembly or vendor‑specific SDKs.
  • Better OOD handling in production: Applications that must reject anomalous inputs (e.g., medical imaging, autonomous navigation) can benefit from calibrated uncertainty without the latency penalty of traditional BNNs.
  • Foundation for further acceleration: The Gaussian‑propagation kernels could be integrated into hardware accelerators (e.g., FPGA or ASIC) that natively support mean/variance arithmetic, opening the door to even faster Bayesian inference.

Limitations & Future Work

  • Gaussian restriction: The analytic formulas rely on the assumption that weight and activation distributions remain Gaussian; this may limit expressiveness for highly non‑linear tasks.
  • Scalability to very deep networks: While the paper demonstrates MLPs and modest CNNs, extending PFP to very deep architectures (e.g., ResNets, Transformers) may encounter numerical stability issues.
  • Limited benchmark diversity: Evaluation is focused on Dirty‑MNIST; broader tests on vision (ImageNet), speech, or time‑series data would strengthen claims about generality.
  • Hardware scope: Experiments target ARM CPUs; exploring GPU, DSP, or dedicated AI accelerators could reveal additional performance gains or constraints.

Future research directions include relaxing the Gaussian assumption (e.g., mixture models), integrating PFP into mixed‑precision pipelines, and co‑designing custom silicon that accelerates mean‑variance arithmetic for Bayesian inference.

Authors

  • Bernhard Klein
  • Falk Selker
  • Hendrik Borras
  • Sophie Steger
  • Franz Pernkopf
  • Holger Fröning

Paper Information

  • arXiv ID: 2511.23440v1
  • Categories: cs.LG, cs.AR, cs.DC, stat.ML
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »