[Paper] Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction

Published: (February 26, 2026 at 01:20 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.23312v1

Overview

This paper investigates whether tiny language models—think “pocket‑sized” versions of GPT—can reliably tell who is the leader and who is the follower in a human‑robot dialogue. By benchmarking a 0.5 B‑parameter model (Qwen2.5‑0.5B) on a new leader‑follower dataset, the authors show that a modest amount of fine‑tuning can give 86 %+ accuracy with sub‑30 ms inference latency, making on‑device role assignment feasible for low‑power robots.

Key Contributions

  • New benchmark dataset for leader‑follower interaction, built from an existing HRI corpus and enriched with synthetic dialogues that capture realistic turn‑taking dynamics.
  • Systematic evaluation of small language models (SLMs) under four adaptation scenarios:
    1. Zero‑shot prompt engineering
    2. Zero‑shot fine‑tuning (single‑epoch on the whole dataset)
    3. One‑shot prompt engineering (one example in the prompt)
    4. One‑shot fine‑tuning (one example per class)
  • Empirical evidence that zero‑shot fine‑tuning outperforms all prompt‑based baselines, achieving 86.66 % accuracy while keeping inference latency at ~22 ms per utterance.
  • Analysis of context‑length limits showing that one‑shot setups degrade performance because the model’s limited context window can’t comfortably hold the extra example plus the dialogue history.
  • Open‑source release of the dataset, training scripts, and evaluation metrics to encourage reproducibility.

Methodology

  1. Data preparation – The authors start with a public HRI dialogue corpus, extract turns labeled as “leader” or “follower,” and generate additional synthetic exchanges using rule‑based templates (e.g., varying instruction phrasing, adding filler words). The final set contains ~12 k labeled utterances.
  2. Model selection – Qwen2.5‑0.5B is chosen as a representative SLM because it fits comfortably on a typical edge device (≈2 GB RAM).
  3. Adaptation strategies
    • Prompt engineering: a handcrafted prompt (“Is the speaker the leader or follower?”) is fed to the model with or without a single example (one‑shot).
    • Fine‑tuning: the model’s classification head is trained for a few epochs on the full dataset (zero‑shot) or on a single example per class (one‑shot).
  4. Evaluation – Accuracy, latency (ms per sample on a Raspberry Pi 4), and memory footprint are measured. A naïve “untrained baseline” that always predicts the majority class is included for reference.

Results & Findings

AdaptationAccuracyLatency (ms)Notes
Untrained baseline52.3 %5Random guess on balanced data
Zero‑shot prompt68.1 %9Simple prompt works surprisingly well
One‑shot prompt71.4 %12Slight boost, but still limited
Zero‑shot fine‑tune86.7 %22Best trade‑off; stable across dialogue lengths
One‑shot fine‑tune62.9 %24Performance drops due to context overflow

Key takeaways:

  • Fine‑tuning even a tiny model yields a large accuracy jump over pure prompting.
  • The latency remains well under 30 ms, suitable for real‑time robot control loops.
  • One‑shot approaches suffer because the extra example pushes the total token count beyond the model’s effective context window, causing the classifier to miss crucial cues.

Practical Implications

  • Edge deployment: Developers can embed a 0.5 B‑parameter model on a robot’s onboard CPU/GPU and still achieve near‑real‑time role classification without relying on cloud APIs.
  • Simplified HRI pipelines: Instead of hand‑crafting rule‑based role detectors, a small fine‑tuned model can handle variations in phrasing, accents, and background noise, reducing engineering effort.
  • Scalable to other binary decisions: The same workflow (light fine‑tuning + minimal latency) can be repurposed for tasks like “command vs. feedback” or “intent vs. clarification” in voice‑controlled assistants.
  • Cost‑effective: Lower memory and compute requirements translate to cheaper hardware (e.g., Jetson Nano, Raspberry Pi) and longer battery life for mobile assistive robots.

Limitations & Future Work

  • Context window: The 0.5 B model’s 2 k‑token limit hampers one‑shot performance; larger SLMs with longer windows might close the gap.
  • Dataset scope: Synthetic augmentation, while helpful, may not capture all nuances of real‑world HRI (e.g., multimodal cues, noisy environments).
  • Generalization: Experiments focus on a single language (English) and a specific robot platform; cross‑lingual and cross‑hardware validation remain open.
  • Future directions suggested by the authors include exploring parameter‑efficient fine‑tuning (e.g., LoRA), multimodal extensions that fuse speech/audio features, and continual learning to adapt to new users on‑the‑fly.

Authors

  • Rafael R. Baptista
  • André de Lima Salgado
  • Ricardo V. Godoy
  • Marcelo Becker
  • Thiago Boaventura
  • Gustavo J. G. Lahr

Paper Information

  • arXiv ID: 2602.23312v1
  • Categories: cs.HC, cs.AI, cs.LG, cs.RO, eess.SY
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...