[Paper] An Agentic AI System for Multi-Framework Communication Coding

Published: 2 months ago (December 9, 2025 at 09:46 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08659v1

Overview

The paper presents MOSAIC, a modular AI system that can automatically annotate clinical conversations using multiple communication frameworks. By chaining specialized agents in a LangGraph workflow, MOSAIC achieves near‑human accuracy while remaining adaptable to different medical specialties and coding schemes.

Key Contributions

Agentic Architecture: Introduces a LangGraph‑based pipeline with four cooperating agents (Plan, Update, Annotation, Verification) that together handle codebook selection, data retrieval, generation, and consistency checking.
Multi‑Framework Support: Works across several established communication codebooks (e.g., patient behavior, provider empathy) without retraining a monolithic model for each.
Retrieval‑Augmented Generation (RAG) + Dynamic Few‑Shot Prompting: Combines up‑to‑date domain literature with on‑the‑fly prompt construction to keep the system both current and context‑aware.
High Empirical Performance: Reaches an overall F1 of 0.928 on a held‑out test set of 50 transcripts, with a peak F1 of 0.962 in rheumatology.
Open‑Source‑Ready Design: Built on LangGraph, a Pythonic framework that developers can extend or embed in existing health‑tech pipelines.

Methodology

Plan Agent – Takes a user‑specified communication framework (e.g., “Patient Behavior”) and selects the appropriate codebook, then outlines a step‑by‑step workflow for the downstream agents.
Update Agent – Periodically refreshes a vector store of clinical literature, guidelines, and previously annotated transcripts, ensuring the retrieval component always draws from the latest evidence.
Annotation Agents – For each segment of a conversation, they perform a retrieval‑augmented generation:
- Retrieve the top‑k relevant passages from the vector store.
- Build a dynamic few‑shot prompt that includes the codebook definitions and the retrieved snippets.
- Generate a label (or set of labels) for the segment using a large language model (LLM).
Verification Agent – Runs a consistency check across the whole transcript (e.g., no contradictory labels, adherence to codebook constraints) and feeds corrective feedback back to the Annotation Agents.

The whole pipeline is orchestrated by LangGraph, which treats each agent as a node in a directed graph, allowing easy debugging, parallel execution, and plug‑and‑play replacement of components.

Results & Findings

Domain / Subset	F1 Score	Notable Strength
Overall Test Set	0.928	Consistently high across frameworks
Rheumatology	0.962	Best performance, likely due to richer training data
OB/GYN	~0.89	Slightly lower but still strong
Patient Behavior Labels	Highest precision/recall	Captures questions, preferences, assertiveness well

Ablation studies showed that removing any of the four agents drops performance by 3–7 percentage points, confirming that planning, up‑to‑date retrieval, and verification are all essential. Compared to a single‑task LLM baseline, MOSAIC improves F1 by roughly 0.12 on average.

Practical Implications

Scalable Annotation: Health‑tech platforms can automatically code large volumes of provider‑patient dialogs for quality‑improvement dashboards, compliance monitoring, or research datasets without hiring a team of annotators.
Rapid Adaptation: Want to add a new communication framework (e.g., shared decision‑making)? Just plug in a new codebook and let the Plan Agent handle the workflow—no full‑model retraining needed.
Continuous Learning: The Update Agent’s retrieval database can be refreshed daily with the latest clinical guidelines, ensuring the system stays aligned with evolving best practices.
Developer Friendly: Because the system is built on LangGraph, developers can replace the underlying LLM (e.g., switch from OpenAI GPT‑4 to a locally hosted Llama 2) or swap out the vector store (FAISS, Milvus, etc.) with minimal code changes.
Regulatory & Auditable: The Verification Agent provides a traceable consistency check, which can be logged for compliance audits or for generating human‑readable explanations of AI decisions.

Limitations & Future Work

Training Data Size: The model was trained on only 26 gold‑standard transcripts; while performance is impressive, broader validation on larger, more diverse datasets is needed.
Domain Transfer: Slight performance dip in OB/GYN suggests that additional domain‑specific fine‑tuning or richer retrieval corpora could improve generalization.
Explainability: Although the Verification Agent logs inconsistencies, the system does not yet produce natural‑language rationales for each label—an area the authors plan to explore.
Real‑World Deployment: Handling noisy audio transcriptions, multilingual conversations, and privacy‑preserving retrieval (e.g., on‑device embeddings) remain open challenges for production use.

Bottom line: MOSAIC demonstrates that an agentic, retrieval‑augmented approach can bring near‑human annotation quality to clinical communication coding—opening the door for scalable, adaptable AI tools in health‑tech ecosystems.

Authors

Bohao Yang
Rui Yang
Joshua M. Biro
Haoyuan Wang
Jessica L. Handley
Brianna Richardson
Sophia Bessias
Nicoleta Economou‑Zavlanos
Armando D. Bedoya
Monica Agrawal
Michael M. Zavlanos
Anand Chowdhury
Raj M. Ratwani
Kai Sun
Kathryn I. Pollak
Michael J. Pencina
Chuan Hong

Paper Information

arXiv ID: 2512.08659v1
Categories: cs.CL, cs.LG
Published: December 9, 2025
PDF: Download PDF

[Paper] An Agentic AI System for Multi-Framework Communication Coding

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models