[Paper] Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Source: arXiv - 2602.23312v1
Overview
This paper investigates whether tiny language models—think “pocket‑sized” versions of GPT—can reliably tell who is the leader and who is the follower in a human‑robot dialogue. By benchmarking a 0.5 B‑parameter model (Qwen2.5‑0.5B) on a new leader‑follower dataset, the authors show that a modest amount of fine‑tuning can give 86 %+ accuracy with sub‑30 ms inference latency, making on‑device role assignment feasible for low‑power robots.
Key Contributions
- New benchmark dataset for leader‑follower interaction, built from an existing HRI corpus and enriched with synthetic dialogues that capture realistic turn‑taking dynamics.
- Systematic evaluation of small language models (SLMs) under four adaptation scenarios:
- Zero‑shot prompt engineering
- Zero‑shot fine‑tuning (single‑epoch on the whole dataset)
- One‑shot prompt engineering (one example in the prompt)
- One‑shot fine‑tuning (one example per class)
- Empirical evidence that zero‑shot fine‑tuning outperforms all prompt‑based baselines, achieving 86.66 % accuracy while keeping inference latency at ~22 ms per utterance.
- Analysis of context‑length limits showing that one‑shot setups degrade performance because the model’s limited context window can’t comfortably hold the extra example plus the dialogue history.
- Open‑source release of the dataset, training scripts, and evaluation metrics to encourage reproducibility.
Methodology
- Data preparation – The authors start with a public HRI dialogue corpus, extract turns labeled as “leader” or “follower,” and generate additional synthetic exchanges using rule‑based templates (e.g., varying instruction phrasing, adding filler words). The final set contains ~12 k labeled utterances.
- Model selection – Qwen2.5‑0.5B is chosen as a representative SLM because it fits comfortably on a typical edge device (≈2 GB RAM).
- Adaptation strategies
- Prompt engineering: a handcrafted prompt (“Is the speaker the leader or follower?”) is fed to the model with or without a single example (one‑shot).
- Fine‑tuning: the model’s classification head is trained for a few epochs on the full dataset (zero‑shot) or on a single example per class (one‑shot).
- Evaluation – Accuracy, latency (ms per sample on a Raspberry Pi 4), and memory footprint are measured. A naïve “untrained baseline” that always predicts the majority class is included for reference.
Results & Findings
| Adaptation | Accuracy | Latency (ms) | Notes |
|---|---|---|---|
| Untrained baseline | 52.3 % | 5 | Random guess on balanced data |
| Zero‑shot prompt | 68.1 % | 9 | Simple prompt works surprisingly well |
| One‑shot prompt | 71.4 % | 12 | Slight boost, but still limited |
| Zero‑shot fine‑tune | 86.7 % | 22 | Best trade‑off; stable across dialogue lengths |
| One‑shot fine‑tune | 62.9 % | 24 | Performance drops due to context overflow |
Key takeaways:
- Fine‑tuning even a tiny model yields a large accuracy jump over pure prompting.
- The latency remains well under 30 ms, suitable for real‑time robot control loops.
- One‑shot approaches suffer because the extra example pushes the total token count beyond the model’s effective context window, causing the classifier to miss crucial cues.
Practical Implications
- Edge deployment: Developers can embed a 0.5 B‑parameter model on a robot’s onboard CPU/GPU and still achieve near‑real‑time role classification without relying on cloud APIs.
- Simplified HRI pipelines: Instead of hand‑crafting rule‑based role detectors, a small fine‑tuned model can handle variations in phrasing, accents, and background noise, reducing engineering effort.
- Scalable to other binary decisions: The same workflow (light fine‑tuning + minimal latency) can be repurposed for tasks like “command vs. feedback” or “intent vs. clarification” in voice‑controlled assistants.
- Cost‑effective: Lower memory and compute requirements translate to cheaper hardware (e.g., Jetson Nano, Raspberry Pi) and longer battery life for mobile assistive robots.
Limitations & Future Work
- Context window: The 0.5 B model’s 2 k‑token limit hampers one‑shot performance; larger SLMs with longer windows might close the gap.
- Dataset scope: Synthetic augmentation, while helpful, may not capture all nuances of real‑world HRI (e.g., multimodal cues, noisy environments).
- Generalization: Experiments focus on a single language (English) and a specific robot platform; cross‑lingual and cross‑hardware validation remain open.
- Future directions suggested by the authors include exploring parameter‑efficient fine‑tuning (e.g., LoRA), multimodal extensions that fuse speech/audio features, and continual learning to adapt to new users on‑the‑fly.
Authors
- Rafael R. Baptista
- André de Lima Salgado
- Ricardo V. Godoy
- Marcelo Becker
- Thiago Boaventura
- Gustavo J. G. Lahr
Paper Information
- arXiv ID: 2602.23312v1
- Categories: cs.HC, cs.AI, cs.LG, cs.RO, eess.SY
- Published: February 26, 2026
- PDF: Download PDF