[Paper] Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence

Published: (January 13, 2026 at 12:58 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.08241v1

Overview

The paper tackles a core challenge for smart‑home and IoT applications: recognizing Activities of Daily Living (ADLs) without the costly need for manually labeled sensor data. By marrying large language models (LLMs) with a smarter way of slicing sensor streams—event‑based segmentation—the authors achieve zero‑shot ADL recognition that rivals (and sometimes beats) traditional supervised methods, while also providing a built‑in confidence score for each prediction.

Key Contributions

  • Event‑based segmentation: Replaces the common fixed‑window (time‑based) approach with a segmentation that aligns with natural activity boundaries, better matching LLMs’ contextual reasoning.
  • Confidence estimation: Introduces a lightweight metric that quantifies how trustworthy each LLM‑generated activity label is, enabling downstream systems to act only on high‑confidence predictions.
  • Zero‑shot performance boost: Demonstrates that even relatively small LLMs (e.g., Gemma‑3 27B) outperform state‑of‑the‑art supervised classifiers on realistic, multi‑sensor datasets.
  • Comprehensive evaluation: Benchmarks on complex, real‑world smart‑home recordings, showing consistent gains across different activity complexities and sensor setups.

Methodology

  1. Data collection – Sensor streams (motion, temperature, door contacts, etc.) from a smart home are treated as a continuous time series.
  2. Event‑based segmentation – Instead of chopping the stream into fixed‑size windows, the system detects change points (e.g., a door opening, a motion burst) and creates segments that correspond to actual events. This yields variable‑length chunks that more naturally describe a single activity.
  3. Prompt engineering – Each segment is transformed into a textual description (e.g., “motion detected in kitchen, fridge door opened”) and fed to an LLM along with a prompt that asks the model to label the ADL (e.g., “What activity is likely happening?”).
  4. Confidence measure – The authors extract the LLM’s internal token probabilities and compute a normalized score that reflects how decisively the model chose a label versus alternatives.
  5. Evaluation – The pipeline is compared against:
    • Traditional time‑window LLM baselines.
    • Supervised classifiers trained on the same sensor data (e.g., Random Forest, CNN‑LSTM).

Results & Findings

ApproachF1‑score (average)Confidence‑AUC
Time‑window LLM (Gemma‑3 27B)0.710.68
Event‑based LLM (Gemma‑3 27B)0.840.89
Supervised CNN‑LSTM (full labels)0.78N/A
Supervised Random Forest0.73N/A
  • Event‑based segmentation yields a ~13 % absolute F1 improvement over the time‑window baseline and outperforms the best supervised model despite having zero labeled ADL data.
  • The confidence metric achieves an AUC of 0.89, meaning it reliably separates correct from incorrect predictions; developers can set a threshold to filter low‑confidence outputs.
  • Even with a 27‑billion‑parameter LLM, the system runs comfortably on a single GPU, showing that the approach scales to modest hardware.

Practical Implications

  • Rapid deployment: Smart‑home vendors can roll out activity‑aware services (e.g., fall detection, energy‑saving routines) without the months‑long data‑annotation phase.
  • Edge‑friendly pipelines: Event‑based segmentation reduces the amount of data sent to the LLM, lowering bandwidth and latency—critical for on‑device or fog‑computing scenarios.
  • Safety‑critical gating: The confidence score lets applications trigger alerts (e.g., medical emergency) only when the model is sufficiently sure, reducing false alarms.
  • Cross‑domain portability: Because the method relies on generic sensor events and a language model, it can be adapted to other domains (industrial IoT, office occupancy monitoring) with minimal re‑engineering.

Limitations & Future Work

  • Sensor diversity: The experiments focus on a specific smart‑home sensor suite; performance on highly heterogeneous or sparse sensor setups remains to be validated.
  • LLM size vs. latency: While 27B‑parameter models are manageable on modern GPUs, ultra‑low‑power edge devices may still need smaller models or quantized variants.
  • Confidence calibration: The proposed metric works well empirically, but a formal probabilistic calibration (e.g., temperature scaling) could further improve reliability.
  • User privacy: Translating raw sensor data into textual prompts may expose sensitive patterns; future work should explore privacy‑preserving prompt encoding.

Bottom line: By aligning sensor segmentation with the way LLMs think—through events rather than arbitrary time windows—and adding a confidence filter, this research opens the door to truly plug‑and‑play activity recognition in smart environments, cutting out the data‑labeling bottleneck while keeping developers in control of reliability.

Authors

  • Michele Fiori
  • Gabriele Civitarese
  • Marco Colussi
  • Claudio Bettini

Paper Information

  • arXiv ID: 2601.08241v1
  • Categories: cs.CV, cs.DC
  • Published: January 13, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »