[Paper] Quantifying Uncertainty in Machine Learning-Based Pervasive Systems: Application to Human Activity Recognition

Published: (December 10, 2025 at 10:56 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09775v1

Overview

The paper “Quantifying Uncertainty in Machine Learning‑Based Pervasive Systems: Application to Human Activity Recognition” tackles a practical problem that many developers face today: how to know when a machine‑learning model is likely to be wrong in real‑time, embedded (pervasive) applications. By adapting a suite of uncertainty‑estimation techniques, the authors show how to surface confidence scores for activity‑recognition models and let systems react safely when confidence drops.

Key Contributions

  • Unified uncertainty‑estimation pipeline that combines several state‑of‑the‑art methods (Monte‑Carlo dropout, deep ensembles, and predictive entropy) for on‑device inference.
  • Runtime relevance assessment: a lightweight decision module that flags predictions whose uncertainty exceeds a configurable threshold.
  • Empirical validation on Human Activity Recognition (HAR) datasets covering diverse sensors, activities, and users, demonstrating that uncertainty correlates with misclassifications.
  • Tooling for domain experts: visual dashboards and APIs that expose confidence metrics, enabling iterative model improvement and safer deployment.
  • Guidelines for integrating uncertainty quantification (UQ) into pervasive systems without breaking real‑time constraints.

Methodology

  1. Model selection – The authors start with a conventional deep neural network (CNN/LSTM hybrid) trained on raw sensor streams (accelerometer, gyroscope, etc.).
  2. Uncertainty techniques – Three complementary methods are applied:
    • Monte‑Carlo (MC) dropout: dropout layers stay active at inference, and the model is run multiple times to obtain a distribution of predictions.
    • Deep ensembles: several independently trained models vote, and variance among their outputs serves as an uncertainty proxy.
    • Predictive entropy: the entropy of the softmax output is computed directly as a scalar confidence measure.
  3. Fusion & thresholding – The three signals are normalized and combined (weighted average) to produce a single “relevance score.” A simple rule‑based threshold decides whether to accept or reject a prediction at runtime.
  4. Evaluation protocol – Experiments use public HAR benchmarks (e.g., UCI HAR, PAMAP2) and a custom in‑the‑wild dataset collected from smartphones and wearables. The authors report standard classification metrics and uncertainty‑aware metrics such as coverage (fraction of predictions kept) vs. accuracy trade‑off.
  5. Tool support – An open‑source Python library wraps the pipeline, exposing a REST endpoint and a lightweight dashboard for visualizing confidence over time.

Results & Findings

MetricBaseline (no UQ)With MC‑DropoutWith EnsemblesCombined Approach
Overall accuracy92.3 %91.8 %92.0 %92.1 %
Coverage @ 95 % accuracy68 %74 %77 %81 %
Misclassification detection (AUROC)0.710.780.810.86
  • Uncertainty correlates strongly with errors: predictions flagged as “high‑uncertainty” are wrong 68 % of the time versus 8 % for low‑uncertainty ones.
  • Runtime overhead stays below 15 ms on a typical ARM Cortex‑A53, satisfying most real‑time HAR use‑cases.
  • Domain experts using the dashboard could pinpoint sensor drift (e.g., a loose wristband) as the cause of rising uncertainty, prompting a quick recalibration.

Practical Implications

  • Safer edge AI: Devices can automatically fall back to rule‑based heuristics or request user confirmation when confidence is low, reducing the risk of erroneous actions (e.g., false fall detection).
  • Dynamic model management: Cloud services can schedule retraining only for data segments that consistently trigger high uncertainty, saving bandwidth and compute.
  • Compliance & auditability: Providing a confidence score satisfies emerging regulations that demand explainability for AI‑driven decisions in health, automotive, and workplace safety.
  • Developer ergonomics: The supplied library abstracts away the math, letting engineers add uncertainty checks with a single line of code (model.predict_with_uncertainty(x)).
  • Cross‑domain portability: Although evaluated on HAR, the same pipeline can be transplanted to other pervasive tasks—gesture recognition, environmental monitoring, or on‑device speech commands.

Limitations & Future Work

  • Scalability to larger models: MC‑dropout and ensembles multiply inference cost; the paper notes that for heavyweight CNNs (e.g., ResNet‑50) the latency may exceed acceptable bounds on low‑power chips.
  • Threshold selection: The current rule‑based threshold is static; adaptive thresholds that consider context (battery level, user activity) remain unexplored.
  • Dataset diversity: Experiments focus on a handful of public HAR datasets; broader validation on heterogeneous sensor setups (e.g., smart glasses, IoT hubs) is needed.
  • Explainability beyond confidence: Future work could integrate feature‑level attribution (e.g., SHAP) to tell why a prediction is uncertain, further aiding debugging and user trust.

Bottom line: By quantifying uncertainty at runtime, developers can turn “black‑box” ML models into more predictable components of pervasive systems, unlocking safer, more maintainable AI‑enabled products. The authors provide both a solid experimental foundation and ready‑to‑use tooling—making it a compelling read for anyone building AI on the edge.

Authors

  • Vladimir Balditsyn
  • Philippe Lalanda
  • German Vega
  • Stéphanie Chollet

Paper Information

  • arXiv ID: 2512.09775v1
  • Categories: cs.SE, cs.AI
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »