[Paper] Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction

Published: 3 days ago (February 26, 2026 at 01:20 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.23312v1

Overview

This paper investigates whether tiny language models—think “pocket‑sized” versions of GPT—can reliably tell who is the leader and who is the follower in a human‑robot dialogue. By benchmarking a 0.5 B‑parameter model (Qwen2.5‑0.5B) on a new leader‑follower dataset, the authors show that a modest amount of fine‑tuning can give 86 %+ accuracy with sub‑30 ms inference latency, making on‑device role assignment feasible for low‑power robots.

Key Contributions

New benchmark dataset for leader‑follower interaction, built from an existing HRI corpus and enriched with synthetic dialogues that capture realistic turn‑taking dynamics.
Systematic evaluation of small language models (SLMs) under four adaptation scenarios:
1. Zero‑shot prompt engineering
2. Zero‑shot fine‑tuning (single‑epoch on the whole dataset)
3. One‑shot prompt engineering (one example in the prompt)
4. One‑shot fine‑tuning (one example per class)
Empirical evidence that zero‑shot fine‑tuning outperforms all prompt‑based baselines, achieving 86.66 % accuracy while keeping inference latency at ~22 ms per utterance.
Analysis of context‑length limits showing that one‑shot setups degrade performance because the model’s limited context window can’t comfortably hold the extra example plus the dialogue history.
Open‑source release of the dataset, training scripts, and evaluation metrics to encourage reproducibility.

Methodology

Data preparation – The authors start with a public HRI dialogue corpus, extract turns labeled as “leader” or “follower,” and generate additional synthetic exchanges using rule‑based templates (e.g., varying instruction phrasing, adding filler words). The final set contains ~12 k labeled utterances.
Model selection – Qwen2.5‑0.5B is chosen as a representative SLM because it fits comfortably on a typical edge device (≈2 GB RAM).
Adaptation strategies
- Prompt engineering: a handcrafted prompt (“Is the speaker the leader or follower?”) is fed to the model with or without a single example (one‑shot).
- Fine‑tuning: the model’s classification head is trained for a few epochs on the full dataset (zero‑shot) or on a single example per class (one‑shot).
Evaluation – Accuracy, latency (ms per sample on a Raspberry Pi 4), and memory footprint are measured. A naïve “untrained baseline” that always predicts the majority class is included for reference.

Results & Findings

Adaptation	Accuracy	Latency (ms)	Notes
Untrained baseline	52.3 %	5	Random guess on balanced data
Zero‑shot prompt	68.1 %	9	Simple prompt works surprisingly well
One‑shot prompt	71.4 %	12	Slight boost, but still limited
Zero‑shot fine‑tune	86.7 %	22	Best trade‑off; stable across dialogue lengths
One‑shot fine‑tune	62.9 %	24	Performance drops due to context overflow

Key takeaways:

Fine‑tuning even a tiny model yields a large accuracy jump over pure prompting.
The latency remains well under 30 ms, suitable for real‑time robot control loops.
One‑shot approaches suffer because the extra example pushes the total token count beyond the model’s effective context window, causing the classifier to miss crucial cues.

Practical Implications

Edge deployment: Developers can embed a 0.5 B‑parameter model on a robot’s onboard CPU/GPU and still achieve near‑real‑time role classification without relying on cloud APIs.
Simplified HRI pipelines: Instead of hand‑crafting rule‑based role detectors, a small fine‑tuned model can handle variations in phrasing, accents, and background noise, reducing engineering effort.
Scalable to other binary decisions: The same workflow (light fine‑tuning + minimal latency) can be repurposed for tasks like “command vs. feedback” or “intent vs. clarification” in voice‑controlled assistants.
Cost‑effective: Lower memory and compute requirements translate to cheaper hardware (e.g., Jetson Nano, Raspberry Pi) and longer battery life for mobile assistive robots.

Limitations & Future Work

Context window: The 0.5 B model’s 2 k‑token limit hampers one‑shot performance; larger SLMs with longer windows might close the gap.
Dataset scope: Synthetic augmentation, while helpful, may not capture all nuances of real‑world HRI (e.g., multimodal cues, noisy environments).
Generalization: Experiments focus on a single language (English) and a specific robot platform; cross‑lingual and cross‑hardware validation remain open.
Future directions suggested by the authors include exploring parameter‑efficient fine‑tuning (e.g., LoRA), multimodal extensions that fuse speech/audio features, and continual learning to adapt to new users on‑the‑fly.

Authors

Rafael R. Baptista
André de Lima Salgado
Ricardo V. Godoy
Marcelo Becker
Thiago Boaventura
Gustavo J. G. Lahr

Paper Information

arXiv ID: 2602.23312v1
Categories: cs.HC, cs.AI, cs.LG, cs.RO, eess.SY
Published: February 26, 2026
PDF: Download PDF

[Paper] Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport