[Paper] Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence

Published: 3 weeks ago (January 13, 2026 at 12:58 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.08241v1

Overview

The paper tackles a core challenge for smart‑home and IoT applications: recognizing Activities of Daily Living (ADLs) without the costly need for manually labeled sensor data. By marrying large language models (LLMs) with a smarter way of slicing sensor streams—event‑based segmentation—the authors achieve zero‑shot ADL recognition that rivals (and sometimes beats) traditional supervised methods, while also providing a built‑in confidence score for each prediction.

Key Contributions

Event‑based segmentation: Replaces the common fixed‑window (time‑based) approach with a segmentation that aligns with natural activity boundaries, better matching LLMs’ contextual reasoning.
Confidence estimation: Introduces a lightweight metric that quantifies how trustworthy each LLM‑generated activity label is, enabling downstream systems to act only on high‑confidence predictions.
Zero‑shot performance boost: Demonstrates that even relatively small LLMs (e.g., Gemma‑3 27B) outperform state‑of‑the‑art supervised classifiers on realistic, multi‑sensor datasets.
Comprehensive evaluation: Benchmarks on complex, real‑world smart‑home recordings, showing consistent gains across different activity complexities and sensor setups.

Methodology

Data collection – Sensor streams (motion, temperature, door contacts, etc.) from a smart home are treated as a continuous time series.
Event‑based segmentation – Instead of chopping the stream into fixed‑size windows, the system detects change points (e.g., a door opening, a motion burst) and creates segments that correspond to actual events. This yields variable‑length chunks that more naturally describe a single activity.
Prompt engineering – Each segment is transformed into a textual description (e.g., “motion detected in kitchen, fridge door opened”) and fed to an LLM along with a prompt that asks the model to label the ADL (e.g., “What activity is likely happening?”).
Confidence measure – The authors extract the LLM’s internal token probabilities and compute a normalized score that reflects how decisively the model chose a label versus alternatives.
Evaluation – The pipeline is compared against:
- Traditional time‑window LLM baselines.
- Supervised classifiers trained on the same sensor data (e.g., Random Forest, CNN‑LSTM).

Results & Findings

Approach	F1‑score (average)	Confidence‑AUC
Time‑window LLM (Gemma‑3 27B)	0.71	0.68
Event‑based LLM (Gemma‑3 27B)	0.84	0.89
Supervised CNN‑LSTM (full labels)	0.78	N/A
Supervised Random Forest	0.73	N/A

Event‑based segmentation yields a ~13 % absolute F1 improvement over the time‑window baseline and outperforms the best supervised model despite having zero labeled ADL data.
The confidence metric achieves an AUC of 0.89, meaning it reliably separates correct from incorrect predictions; developers can set a threshold to filter low‑confidence outputs.
Even with a 27‑billion‑parameter LLM, the system runs comfortably on a single GPU, showing that the approach scales to modest hardware.

Practical Implications

Rapid deployment: Smart‑home vendors can roll out activity‑aware services (e.g., fall detection, energy‑saving routines) without the months‑long data‑annotation phase.
Edge‑friendly pipelines: Event‑based segmentation reduces the amount of data sent to the LLM, lowering bandwidth and latency—critical for on‑device or fog‑computing scenarios.
Safety‑critical gating: The confidence score lets applications trigger alerts (e.g., medical emergency) only when the model is sufficiently sure, reducing false alarms.
Cross‑domain portability: Because the method relies on generic sensor events and a language model, it can be adapted to other domains (industrial IoT, office occupancy monitoring) with minimal re‑engineering.

Limitations & Future Work

Sensor diversity: The experiments focus on a specific smart‑home sensor suite; performance on highly heterogeneous or sparse sensor setups remains to be validated.
LLM size vs. latency: While 27B‑parameter models are manageable on modern GPUs, ultra‑low‑power edge devices may still need smaller models or quantized variants.
Confidence calibration: The proposed metric works well empirically, but a formal probabilistic calibration (e.g., temperature scaling) could further improve reliability.
User privacy: Translating raw sensor data into textual prompts may expose sensitive patterns; future work should explore privacy‑preserving prompt encoding.

Bottom line: By aligning sensor segmentation with the way LLMs think—through events rather than arbitrary time windows—and adding a confidence filter, this research opens the door to truly plug‑and‑play activity recognition in smart environments, cutting out the data‑labeling bottleneck while keeping developers in control of reliability.

Authors

Michele Fiori
Gabriele Civitarese
Marco Colussi
Claudio Bettini

Paper Information

arXiv ID: 2601.08241v1
Categories: cs.CV, cs.DC
Published: January 13, 2026
PDF: Download PDF

[Paper] Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation