[Paper] Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

Published: (March 5, 2026 at 01:22 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2603.05450v1

Overview

The paper introduces Distributed Partial Information Puzzles (DPIP) – a collaborative task designed to surface how people (and AI agents) build common ground when each participant holds only a piece of the overall information. By collecting multimodal recordings (speech, gestures, and on‑screen actions) of groups solving these puzzles, the authors create a new benchmark for studying belief tracking and coordination in AI systems that must reason across modalities and across multiple agents.

Key Contributions

  • DPIP task & dataset – a novel, multimodal collaboration scenario with fine‑grained, temporally aligned annotations of propositional content, belief updates, and grounding cues.
  • Dual modeling framework – a head‑to‑head comparison of (1) large language models (LLMs) prompted to infer shared beliefs from multimodal transcripts, and (2) an explicit Dynamic Epistemic Logic (DEL) pipeline that updates belief states step‑by‑step.
  • Empirical findings – systematic evaluation showing that current LLMs struggle to maintain accurate belief states in the face of epistemic asymmetry, while the DEL‑based system, though more brittle, better captures the logical progression of common ground.
  • Open resources – the dataset, annotation schema, and code for both modeling approaches are released for the community.

Methodology

  1. Task design – Participants are split into “distributors” (who see a hidden puzzle layout) and “assemblers” (who receive incremental clues). The goal is to reconstruct the full puzzle collaboratively.
  2. Data collection – Sessions are recorded with microphones, depth cameras, and screen capture. Transcriptions are aligned with gesture labels (pointing, nodding) and action logs (drag‑and‑drop moves).
  3. Annotation – Human annotators label each turn with:
    • Propositional content (e.g., “the red block belongs on the left”).
    • Belief state for each participant (what they know, what they assume the other knows).
    • Grounding cues (eye‑contact, pointing) that signal belief updates.
  4. Modeling approaches
    • LLM pipeline: Multimodal inputs are converted to a textual narrative (speech + auto‑generated gesture descriptions). A GPT‑4‑style model is prompted to output the current shared belief set after each turn.
    • DEL pipeline: Formal epistemic actions (public announcements, private observations) are extracted from the annotations and fed into a Dynamic Epistemic Logic engine that updates a Kripke model of agents’ knowledge.
  5. Evaluation – Accuracy of predicted belief sets is measured against the gold annotations, and error analyses focus on missed updates, false positives, and propagation delays.

Results & Findings

ModelBelief‑state accuracy (overall)Task‑progress trackingNotable failure modes
GPT‑4 (LLM)58 %Often loses track after 4–5 dialogue turnsOver‑generalizes from ambiguous gestures; conflates “I think” with “I know”.
DEL pipeline71 %Maintains logical consistency across turnsSensitive to noisy gesture‑to‑action mapping; struggles with informal language.

Key takeaways:

  • Epistemic asymmetry is hard for LLMs – even with sophisticated prompting, they miss subtle belief revisions signaled by non‑verbal cues.
  • Explicit logical reasoning helps – the DEL system, despite requiring hand‑crafted action extraction, better respects the formal structure of common‑ground updates.
  • Multimodal grounding matters – removing gesture annotations drops both models’ performance by ~10 %, underscoring the importance of non‑verbal signals.

Practical Implications

  • Collaborative AI assistants – Voice‑first agents that need to coordinate with humans (e.g., remote troubleshooting, design co‑creation) must incorporate a belief‑tracking layer beyond raw language modeling.
  • Mixed‑reality teamwork – In AR/VR environments where users point, gesture, and manipulate objects, the DPIP dataset offers a testbed for developing agents that can infer intent from combined speech‑gesture streams.
  • Human‑robot interaction (HRI) – Robots operating in shared workspaces can benefit from a DEL‑style module to maintain a formal model of what teammates know, reducing mis‑coordination errors.
  • Evaluation benchmark – The DPIP suite can become a standard benchmark for “common‑ground” capabilities, encouraging the community to move past single‑turn QA toward sustained, multi‑agent dialogue.

Limitations & Future Work

  • Scalability of DEL – The logical pipeline requires manual extraction of epistemic actions; automating this step (e.g., via multimodal parsers) is an open challenge.
  • Domain specificity – Puzzles are abstract and may not capture domain‑specific jargon or high‑stakes decision making found in real‑world collaborations.
  • LLM prompting ceiling – The study uses zero‑shot prompting; fine‑tuning or retrieval‑augmented approaches might close the performance gap, a direction the authors suggest exploring.
  • Richness of non‑verbal cues – Current annotations focus on coarse gestures; finer facial expression or eye‑gaze data could further illuminate belief dynamics.

Overall, the paper shines a light on a critical blind spot in today’s AI—maintaining a shared, evolving understanding across multiple agents—and provides both data and initial modeling baselines to spur practical advances.

Authors

  • Yifan Zhu
  • Mariah Bradford
  • Kenneth Lai
  • Timothy Obiso
  • Videep Venkatesha
  • James Pustejovsky
  • Nikhil Krishnaswamy

Paper Information

  • arXiv ID: 2603.05450v1
  • Categories: cs.AI, cs.CL
  • Published: March 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »