[Paper] Quantifying Uncertainty in Machine Learning-Based Pervasive Systems: Application to Human Activity Recognition
Source: arXiv - 2512.09775v1
Overview
The paper “Quantifying Uncertainty in Machine Learning‑Based Pervasive Systems: Application to Human Activity Recognition” tackles a practical problem that many developers face today: how to know when a machine‑learning model is likely to be wrong in real‑time, embedded (pervasive) applications. By adapting a suite of uncertainty‑estimation techniques, the authors show how to surface confidence scores for activity‑recognition models and let systems react safely when confidence drops.
Key Contributions
- Unified uncertainty‑estimation pipeline that combines several state‑of‑the‑art methods (Monte‑Carlo dropout, deep ensembles, and predictive entropy) for on‑device inference.
- Runtime relevance assessment: a lightweight decision module that flags predictions whose uncertainty exceeds a configurable threshold.
- Empirical validation on Human Activity Recognition (HAR) datasets covering diverse sensors, activities, and users, demonstrating that uncertainty correlates with misclassifications.
- Tooling for domain experts: visual dashboards and APIs that expose confidence metrics, enabling iterative model improvement and safer deployment.
- Guidelines for integrating uncertainty quantification (UQ) into pervasive systems without breaking real‑time constraints.
Methodology
- Model selection – The authors start with a conventional deep neural network (CNN/LSTM hybrid) trained on raw sensor streams (accelerometer, gyroscope, etc.).
- Uncertainty techniques – Three complementary methods are applied:
- Monte‑Carlo (MC) dropout: dropout layers stay active at inference, and the model is run multiple times to obtain a distribution of predictions.
- Deep ensembles: several independently trained models vote, and variance among their outputs serves as an uncertainty proxy.
- Predictive entropy: the entropy of the softmax output is computed directly as a scalar confidence measure.
- Fusion & thresholding – The three signals are normalized and combined (weighted average) to produce a single “relevance score.” A simple rule‑based threshold decides whether to accept or reject a prediction at runtime.
- Evaluation protocol – Experiments use public HAR benchmarks (e.g., UCI HAR, PAMAP2) and a custom in‑the‑wild dataset collected from smartphones and wearables. The authors report standard classification metrics and uncertainty‑aware metrics such as coverage (fraction of predictions kept) vs. accuracy trade‑off.
- Tool support – An open‑source Python library wraps the pipeline, exposing a REST endpoint and a lightweight dashboard for visualizing confidence over time.
Results & Findings
| Metric | Baseline (no UQ) | With MC‑Dropout | With Ensembles | Combined Approach |
|---|---|---|---|---|
| Overall accuracy | 92.3 % | 91.8 % | 92.0 % | 92.1 % |
| Coverage @ 95 % accuracy | 68 % | 74 % | 77 % | 81 % |
| Misclassification detection (AUROC) | 0.71 | 0.78 | 0.81 | 0.86 |
- Uncertainty correlates strongly with errors: predictions flagged as “high‑uncertainty” are wrong 68 % of the time versus 8 % for low‑uncertainty ones.
- Runtime overhead stays below 15 ms on a typical ARM Cortex‑A53, satisfying most real‑time HAR use‑cases.
- Domain experts using the dashboard could pinpoint sensor drift (e.g., a loose wristband) as the cause of rising uncertainty, prompting a quick recalibration.
Practical Implications
- Safer edge AI: Devices can automatically fall back to rule‑based heuristics or request user confirmation when confidence is low, reducing the risk of erroneous actions (e.g., false fall detection).
- Dynamic model management: Cloud services can schedule retraining only for data segments that consistently trigger high uncertainty, saving bandwidth and compute.
- Compliance & auditability: Providing a confidence score satisfies emerging regulations that demand explainability for AI‑driven decisions in health, automotive, and workplace safety.
- Developer ergonomics: The supplied library abstracts away the math, letting engineers add uncertainty checks with a single line of code (
model.predict_with_uncertainty(x)). - Cross‑domain portability: Although evaluated on HAR, the same pipeline can be transplanted to other pervasive tasks—gesture recognition, environmental monitoring, or on‑device speech commands.
Limitations & Future Work
- Scalability to larger models: MC‑dropout and ensembles multiply inference cost; the paper notes that for heavyweight CNNs (e.g., ResNet‑50) the latency may exceed acceptable bounds on low‑power chips.
- Threshold selection: The current rule‑based threshold is static; adaptive thresholds that consider context (battery level, user activity) remain unexplored.
- Dataset diversity: Experiments focus on a handful of public HAR datasets; broader validation on heterogeneous sensor setups (e.g., smart glasses, IoT hubs) is needed.
- Explainability beyond confidence: Future work could integrate feature‑level attribution (e.g., SHAP) to tell why a prediction is uncertain, further aiding debugging and user trust.
Bottom line: By quantifying uncertainty at runtime, developers can turn “black‑box” ML models into more predictable components of pervasive systems, unlocking safer, more maintainable AI‑enabled products. The authors provide both a solid experimental foundation and ready‑to‑use tooling—making it a compelling read for anyone building AI on the edge.
Authors
- Vladimir Balditsyn
- Philippe Lalanda
- German Vega
- Stéphanie Chollet
Paper Information
- arXiv ID: 2512.09775v1
- Categories: cs.SE, cs.AI
- Published: December 10, 2025
- PDF: Download PDF