[Paper] Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

Published: 1 day ago (March 5, 2026 at 01:22 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.05450v1

Overview

The paper introduces Distributed Partial Information Puzzles (DPIP) – a collaborative task designed to surface how people (and AI agents) build common ground when each participant holds only a piece of the overall information. By collecting multimodal recordings (speech, gestures, and on‑screen actions) of groups solving these puzzles, the authors create a new benchmark for studying belief tracking and coordination in AI systems that must reason across modalities and across multiple agents.

Key Contributions

DPIP task & dataset – a novel, multimodal collaboration scenario with fine‑grained, temporally aligned annotations of propositional content, belief updates, and grounding cues.
Dual modeling framework – a head‑to‑head comparison of (1) large language models (LLMs) prompted to infer shared beliefs from multimodal transcripts, and (2) an explicit Dynamic Epistemic Logic (DEL) pipeline that updates belief states step‑by‑step.
Empirical findings – systematic evaluation showing that current LLMs struggle to maintain accurate belief states in the face of epistemic asymmetry, while the DEL‑based system, though more brittle, better captures the logical progression of common ground.
Open resources – the dataset, annotation schema, and code for both modeling approaches are released for the community.

Methodology

Task design – Participants are split into “distributors” (who see a hidden puzzle layout) and “assemblers” (who receive incremental clues). The goal is to reconstruct the full puzzle collaboratively.
Data collection – Sessions are recorded with microphones, depth cameras, and screen capture. Transcriptions are aligned with gesture labels (pointing, nodding) and action logs (drag‑and‑drop moves).
Annotation – Human annotators label each turn with:
- Propositional content (e.g., “the red block belongs on the left”).
- Belief state for each participant (what they know, what they assume the other knows).
- Grounding cues (eye‑contact, pointing) that signal belief updates.
Modeling approaches
- LLM pipeline: Multimodal inputs are converted to a textual narrative (speech + auto‑generated gesture descriptions). A GPT‑4‑style model is prompted to output the current shared belief set after each turn.
- DEL pipeline: Formal epistemic actions (public announcements, private observations) are extracted from the annotations and fed into a Dynamic Epistemic Logic engine that updates a Kripke model of agents’ knowledge.
Evaluation – Accuracy of predicted belief sets is measured against the gold annotations, and error analyses focus on missed updates, false positives, and propagation delays.

Results & Findings

Model	Belief‑state accuracy (overall)	Task‑progress tracking	Notable failure modes
GPT‑4 (LLM)	58 %	Often loses track after 4–5 dialogue turns	Over‑generalizes from ambiguous gestures; conflates “I think” with “I know”.
DEL pipeline	71 %	Maintains logical consistency across turns	Sensitive to noisy gesture‑to‑action mapping; struggles with informal language.

Key takeaways:

Epistemic asymmetry is hard for LLMs – even with sophisticated prompting, they miss subtle belief revisions signaled by non‑verbal cues.
Explicit logical reasoning helps – the DEL system, despite requiring hand‑crafted action extraction, better respects the formal structure of common‑ground updates.
Multimodal grounding matters – removing gesture annotations drops both models’ performance by ~10 %, underscoring the importance of non‑verbal signals.

Practical Implications

Collaborative AI assistants – Voice‑first agents that need to coordinate with humans (e.g., remote troubleshooting, design co‑creation) must incorporate a belief‑tracking layer beyond raw language modeling.
Mixed‑reality teamwork – In AR/VR environments where users point, gesture, and manipulate objects, the DPIP dataset offers a testbed for developing agents that can infer intent from combined speech‑gesture streams.
Human‑robot interaction (HRI) – Robots operating in shared workspaces can benefit from a DEL‑style module to maintain a formal model of what teammates know, reducing mis‑coordination errors.
Evaluation benchmark – The DPIP suite can become a standard benchmark for “common‑ground” capabilities, encouraging the community to move past single‑turn QA toward sustained, multi‑agent dialogue.

Limitations & Future Work

Scalability of DEL – The logical pipeline requires manual extraction of epistemic actions; automating this step (e.g., via multimodal parsers) is an open challenge.
Domain specificity – Puzzles are abstract and may not capture domain‑specific jargon or high‑stakes decision making found in real‑world collaborations.
LLM prompting ceiling – The study uses zero‑shot prompting; fine‑tuning or retrieval‑augmented approaches might close the performance gap, a direction the authors suggest exploring.
Richness of non‑verbal cues – Current annotations focus on coarse gestures; finer facial expression or eye‑gaze data could further illuminate belief dynamics.

Overall, the paper shines a light on a critical blind spot in today’s AI—maintaining a shared, evolving understanding across multiple agents—and provides both data and initial modeling baselines to spur practical advances.

Authors

Yifan Zhu
Mariah Bradford
Kenneth Lai
Timothy Obiso
Videep Venkatesha
James Pustejovsky
Nikhil Krishnaswamy

Paper Information

arXiv ID: 2603.05450v1
Categories: cs.AI, cs.CL
Published: March 5, 2026
PDF: Download PDF

[Paper] Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

[Paper] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought