[Paper] Quantifying Uncertainty in Machine Learning-Based Pervasive Systems: Application to Human Activity Recognition

Published: 2 months ago (December 10, 2025 at 10:56 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09775v1

Overview

The paper “Quantifying Uncertainty in Machine Learning‑Based Pervasive Systems: Application to Human Activity Recognition” tackles a practical problem that many developers face today: how to know when a machine‑learning model is likely to be wrong in real‑time, embedded (pervasive) applications. By adapting a suite of uncertainty‑estimation techniques, the authors show how to surface confidence scores for activity‑recognition models and let systems react safely when confidence drops.

Key Contributions

Unified uncertainty‑estimation pipeline that combines several state‑of‑the‑art methods (Monte‑Carlo dropout, deep ensembles, and predictive entropy) for on‑device inference.
Runtime relevance assessment: a lightweight decision module that flags predictions whose uncertainty exceeds a configurable threshold.
Empirical validation on Human Activity Recognition (HAR) datasets covering diverse sensors, activities, and users, demonstrating that uncertainty correlates with misclassifications.
Tooling for domain experts: visual dashboards and APIs that expose confidence metrics, enabling iterative model improvement and safer deployment.
Guidelines for integrating uncertainty quantification (UQ) into pervasive systems without breaking real‑time constraints.

Methodology

Model selection – The authors start with a conventional deep neural network (CNN/LSTM hybrid) trained on raw sensor streams (accelerometer, gyroscope, etc.).
Uncertainty techniques – Three complementary methods are applied:
- Monte‑Carlo (MC) dropout: dropout layers stay active at inference, and the model is run multiple times to obtain a distribution of predictions.
- Deep ensembles: several independently trained models vote, and variance among their outputs serves as an uncertainty proxy.
- Predictive entropy: the entropy of the softmax output is computed directly as a scalar confidence measure.
Fusion & thresholding – The three signals are normalized and combined (weighted average) to produce a single “relevance score.” A simple rule‑based threshold decides whether to accept or reject a prediction at runtime.
Evaluation protocol – Experiments use public HAR benchmarks (e.g., UCI HAR, PAMAP2) and a custom in‑the‑wild dataset collected from smartphones and wearables. The authors report standard classification metrics and uncertainty‑aware metrics such as coverage (fraction of predictions kept) vs. accuracy trade‑off.
Tool support – An open‑source Python library wraps the pipeline, exposing a REST endpoint and a lightweight dashboard for visualizing confidence over time.

Results & Findings

Metric	Baseline (no UQ)	With MC‑Dropout	With Ensembles	Combined Approach
Overall accuracy	92.3 %	91.8 %	92.0 %	92.1 %
Coverage @ 95 % accuracy	68 %	74 %	77 %	81 %
Misclassification detection (AUROC)	0.71	0.78	0.81	0.86

Uncertainty correlates strongly with errors: predictions flagged as “high‑uncertainty” are wrong 68 % of the time versus 8 % for low‑uncertainty ones.
Runtime overhead stays below 15 ms on a typical ARM Cortex‑A53, satisfying most real‑time HAR use‑cases.
Domain experts using the dashboard could pinpoint sensor drift (e.g., a loose wristband) as the cause of rising uncertainty, prompting a quick recalibration.

Practical Implications

Safer edge AI: Devices can automatically fall back to rule‑based heuristics or request user confirmation when confidence is low, reducing the risk of erroneous actions (e.g., false fall detection).
Dynamic model management: Cloud services can schedule retraining only for data segments that consistently trigger high uncertainty, saving bandwidth and compute.
Compliance & auditability: Providing a confidence score satisfies emerging regulations that demand explainability for AI‑driven decisions in health, automotive, and workplace safety.
Developer ergonomics: The supplied library abstracts away the math, letting engineers add uncertainty checks with a single line of code (model.predict_with_uncertainty(x)).
Cross‑domain portability: Although evaluated on HAR, the same pipeline can be transplanted to other pervasive tasks—gesture recognition, environmental monitoring, or on‑device speech commands.

Limitations & Future Work

Scalability to larger models: MC‑dropout and ensembles multiply inference cost; the paper notes that for heavyweight CNNs (e.g., ResNet‑50) the latency may exceed acceptable bounds on low‑power chips.
Threshold selection: The current rule‑based threshold is static; adaptive thresholds that consider context (battery level, user activity) remain unexplored.
Dataset diversity: Experiments focus on a handful of public HAR datasets; broader validation on heterogeneous sensor setups (e.g., smart glasses, IoT hubs) is needed.
Explainability beyond confidence: Future work could integrate feature‑level attribution (e.g., SHAP) to tell why a prediction is uncertain, further aiding debugging and user trust.

Bottom line: By quantifying uncertainty at runtime, developers can turn “black‑box” ML models into more predictable components of pervasive systems, unlocking safer, more maintainable AI‑enabled products. The authors provide both a solid experimental foundation and ready‑to‑use tooling—making it a compelling read for anyone building AI on the edge.

Authors

Vladimir Balditsyn
Philippe Lalanda
German Vega
Stéphanie Chollet

Paper Information

arXiv ID: 2512.09775v1
Categories: cs.SE, cs.AI
Published: December 10, 2025
PDF: Download PDF

[Paper] Quantifying Uncertainty in Machine Learning-Based Pervasive Systems: Application to Human Activity Recognition

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously