[Paper] MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games

Published: (February 27, 2026 at 12:13 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.24188v1

Overview

The paper introduces MT‑PingEval, a new framework for testing how well large language models (LLMs) collaborate over multiple conversational turns when each participant holds private information. By turning the evaluation into a set of “private‑information games,” the authors can measure whether models actually use dialogue to plan, share, and act more efficiently than a one‑shot summarization baseline.

Key Contributions

  • A scalable multi‑turn evaluation suite: a collection of collaborative games that mimic real‑world scenarios where agents must exchange hidden facts before reaching a joint decision.
  • Interactive token‑budget analysis: the same total number of tokens is allocated across varying numbers of turns, letting researchers see how token efficiency changes with dialogue length.
  • Empirical benchmark across several state‑of‑the‑art LLMs (e.g., GPT‑4, Claude, Llama‑2), revealing a consistent gap between interactive and non‑interactive performance.
  • Linguistic diagnostics: systematic probing of dialogue traits such as sycophancy, information density, and discourse coherence to explain why models struggle.
  • Open‑source release: the MT‑PingEval code, game definitions, and evaluation scripts are publicly available for reproducibility and community extensions.

Methodology

  1. Game Design – Each game defines a hidden “private” state for two agents (e.g., a map location, a secret number, or a set of constraints). The goal is for the agents to cooperate and produce a correct joint answer.
  2. Turn‑Based Interaction – Agents exchange messages for a configurable number of turns. After the dialogue, a final “action” turn is taken where one agent makes the decision based on the shared information.
  3. Token Budgeting – A fixed token budget (e.g., 500 tokens) is split across the dialogue turns. This forces models to balance brevity with completeness.
  4. Baseline Comparison – The non‑interactive baseline lets the “information‑holder” compress its private data into a single summary that the partner consumes immediately.
  5. Metrics – Success rate (correct answer), token efficiency (success per token), and linguistic scores (coherence, redundancy, sycophancy) are recorded.
  6. Model Variants – The authors test zero‑shot prompting, few‑shot exemplars, and chain‑of‑thought style prompts to see which prompting tricks help.

Results & Findings

  • Interactive underperforms baseline – Across all tested LLMs, the multi‑turn version rarely beats the one‑shot summary, even when given the same total token budget.
  • Headroom exists – Human participants achieve ~30 % higher success with far fewer tokens, indicating that the task is not inherently impossible for LLMs.
  • Coherence matters – Dialogues that maintain a clear discourse structure (topic continuity, explicit references) correlate strongly with higher success rates.
  • Sycophancy is a double‑edged sword – Models often produce overly agreeable responses that repeat the partner’s statements without adding new information, hurting token efficiency.
  • Prompting helps modestly – Few‑shot examples and chain‑of‑thought prompts improve information density by ~5–7 % but do not close the gap to human performance.

Practical Implications

  • Chat‑based assistants – Current assistants (customer support bots, collaborative coding partners) may waste bandwidth by failing to distill private context efficiently. MT‑PingEval highlights the need for better planning modules that decide what to say before how to say it.
  • Multi‑agent systems – In robotics or distributed AI, agents often need to negotiate hidden constraints. The benchmark suggests that naïve LLM‑driven coordination will be brittle without explicit dialogue management strategies.
  • Token‑cost optimization – For developers paying per‑token (e.g., OpenAI API), the findings warn that longer back‑and‑forth conversations can be more expensive than a well‑crafted single summary.
  • Prompt engineering – The diagnostic tools (coherence scoring, sycophancy detection) can be incorporated into automated prompt‑tuning pipelines to improve collaborative behavior.
  • Evaluation standards – MT‑PingEval offers a reproducible, task‑oriented alternative to static QA benchmarks, encouraging the community to measure interactive intelligence rather than just single‑turn accuracy.

Limitations & Future Work

  • Game scope – The current suite focuses on relatively abstract puzzles; extending to domain‑specific tasks (e.g., medical triage, software debugging) will test models under more realistic constraints.
  • Model size bias – Larger models tend to generate more fluent but not necessarily more informative dialogue; the study does not isolate scaling effects beyond the tested families.
  • Human‑in‑the‑loop – All evaluations are fully automated; incorporating real users could uncover additional failure modes such as misunderstandings or pragmatic nuances.
  • Planning mechanisms – The authors note that integrating explicit planning or memory modules (e.g., retrieval‑augmented generation) could bridge the interactive gap, a promising direction for follow‑up research.

MT‑PingEval opens a clear path toward LLMs that can truly collaborate, not just answer. As developers start building multi‑agent applications, keeping an eye on these interactive benchmarks will be key to delivering efficient, trustworthy AI partners.

Authors

  • Jacob Eisenstein
  • Fantine Huot
  • Adam Fisch
  • Jonathan Berant
  • Mirella Lapata

Paper Information

  • arXiv ID: 2602.24188v1
  • Categories: cs.CL, cs.LG
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »