[Paper] Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence
Source: arXiv - 2601.08241v1
Overview
The paper tackles a core challenge for smart‑home and IoT applications: recognizing Activities of Daily Living (ADLs) without the costly need for manually labeled sensor data. By marrying large language models (LLMs) with a smarter way of slicing sensor streams—event‑based segmentation—the authors achieve zero‑shot ADL recognition that rivals (and sometimes beats) traditional supervised methods, while also providing a built‑in confidence score for each prediction.
Key Contributions
- Event‑based segmentation: Replaces the common fixed‑window (time‑based) approach with a segmentation that aligns with natural activity boundaries, better matching LLMs’ contextual reasoning.
- Confidence estimation: Introduces a lightweight metric that quantifies how trustworthy each LLM‑generated activity label is, enabling downstream systems to act only on high‑confidence predictions.
- Zero‑shot performance boost: Demonstrates that even relatively small LLMs (e.g., Gemma‑3 27B) outperform state‑of‑the‑art supervised classifiers on realistic, multi‑sensor datasets.
- Comprehensive evaluation: Benchmarks on complex, real‑world smart‑home recordings, showing consistent gains across different activity complexities and sensor setups.
Methodology
- Data collection – Sensor streams (motion, temperature, door contacts, etc.) from a smart home are treated as a continuous time series.
- Event‑based segmentation – Instead of chopping the stream into fixed‑size windows, the system detects change points (e.g., a door opening, a motion burst) and creates segments that correspond to actual events. This yields variable‑length chunks that more naturally describe a single activity.
- Prompt engineering – Each segment is transformed into a textual description (e.g., “motion detected in kitchen, fridge door opened”) and fed to an LLM along with a prompt that asks the model to label the ADL (e.g., “What activity is likely happening?”).
- Confidence measure – The authors extract the LLM’s internal token probabilities and compute a normalized score that reflects how decisively the model chose a label versus alternatives.
- Evaluation – The pipeline is compared against:
- Traditional time‑window LLM baselines.
- Supervised classifiers trained on the same sensor data (e.g., Random Forest, CNN‑LSTM).
Results & Findings
| Approach | F1‑score (average) | Confidence‑AUC |
|---|---|---|
| Time‑window LLM (Gemma‑3 27B) | 0.71 | 0.68 |
| Event‑based LLM (Gemma‑3 27B) | 0.84 | 0.89 |
| Supervised CNN‑LSTM (full labels) | 0.78 | N/A |
| Supervised Random Forest | 0.73 | N/A |
- Event‑based segmentation yields a ~13 % absolute F1 improvement over the time‑window baseline and outperforms the best supervised model despite having zero labeled ADL data.
- The confidence metric achieves an AUC of 0.89, meaning it reliably separates correct from incorrect predictions; developers can set a threshold to filter low‑confidence outputs.
- Even with a 27‑billion‑parameter LLM, the system runs comfortably on a single GPU, showing that the approach scales to modest hardware.
Practical Implications
- Rapid deployment: Smart‑home vendors can roll out activity‑aware services (e.g., fall detection, energy‑saving routines) without the months‑long data‑annotation phase.
- Edge‑friendly pipelines: Event‑based segmentation reduces the amount of data sent to the LLM, lowering bandwidth and latency—critical for on‑device or fog‑computing scenarios.
- Safety‑critical gating: The confidence score lets applications trigger alerts (e.g., medical emergency) only when the model is sufficiently sure, reducing false alarms.
- Cross‑domain portability: Because the method relies on generic sensor events and a language model, it can be adapted to other domains (industrial IoT, office occupancy monitoring) with minimal re‑engineering.
Limitations & Future Work
- Sensor diversity: The experiments focus on a specific smart‑home sensor suite; performance on highly heterogeneous or sparse sensor setups remains to be validated.
- LLM size vs. latency: While 27B‑parameter models are manageable on modern GPUs, ultra‑low‑power edge devices may still need smaller models or quantized variants.
- Confidence calibration: The proposed metric works well empirically, but a formal probabilistic calibration (e.g., temperature scaling) could further improve reliability.
- User privacy: Translating raw sensor data into textual prompts may expose sensitive patterns; future work should explore privacy‑preserving prompt encoding.
Bottom line: By aligning sensor segmentation with the way LLMs think—through events rather than arbitrary time windows—and adding a confidence filter, this research opens the door to truly plug‑and‑play activity recognition in smart environments, cutting out the data‑labeling bottleneck while keeping developers in control of reliability.
Authors
- Michele Fiori
- Gabriele Civitarese
- Marco Colussi
- Claudio Bettini
Paper Information
- arXiv ID: 2601.08241v1
- Categories: cs.CV, cs.DC
- Published: January 13, 2026
- PDF: Download PDF