[Paper] An Agentic AI System for Multi-Framework Communication Coding
Source: arXiv - 2512.08659v1
Overview
The paper presents MOSAIC, a modular AI system that can automatically annotate clinical conversations using multiple communication frameworks. By chaining specialized agents in a LangGraph workflow, MOSAIC achieves near‑human accuracy while remaining adaptable to different medical specialties and coding schemes.
Key Contributions
- Agentic Architecture: Introduces a LangGraph‑based pipeline with four cooperating agents (Plan, Update, Annotation, Verification) that together handle codebook selection, data retrieval, generation, and consistency checking.
- Multi‑Framework Support: Works across several established communication codebooks (e.g., patient behavior, provider empathy) without retraining a monolithic model for each.
- Retrieval‑Augmented Generation (RAG) + Dynamic Few‑Shot Prompting: Combines up‑to‑date domain literature with on‑the‑fly prompt construction to keep the system both current and context‑aware.
- High Empirical Performance: Reaches an overall F1 of 0.928 on a held‑out test set of 50 transcripts, with a peak F1 of 0.962 in rheumatology.
- Open‑Source‑Ready Design: Built on LangGraph, a Pythonic framework that developers can extend or embed in existing health‑tech pipelines.
Methodology
- Plan Agent – Takes a user‑specified communication framework (e.g., “Patient Behavior”) and selects the appropriate codebook, then outlines a step‑by‑step workflow for the downstream agents.
- Update Agent – Periodically refreshes a vector store of clinical literature, guidelines, and previously annotated transcripts, ensuring the retrieval component always draws from the latest evidence.
- Annotation Agents – For each segment of a conversation, they perform a retrieval‑augmented generation:
- Retrieve the top‑k relevant passages from the vector store.
- Build a dynamic few‑shot prompt that includes the codebook definitions and the retrieved snippets.
- Generate a label (or set of labels) for the segment using a large language model (LLM).
- Verification Agent – Runs a consistency check across the whole transcript (e.g., no contradictory labels, adherence to codebook constraints) and feeds corrective feedback back to the Annotation Agents.
The whole pipeline is orchestrated by LangGraph, which treats each agent as a node in a directed graph, allowing easy debugging, parallel execution, and plug‑and‑play replacement of components.
Results & Findings
| Domain / Subset | F1 Score | Notable Strength |
|---|---|---|
| Overall Test Set | 0.928 | Consistently high across frameworks |
| Rheumatology | 0.962 | Best performance, likely due to richer training data |
| OB/GYN | ~0.89 | Slightly lower but still strong |
| Patient Behavior Labels | Highest precision/recall | Captures questions, preferences, assertiveness well |
Ablation studies showed that removing any of the four agents drops performance by 3–7 percentage points, confirming that planning, up‑to‑date retrieval, and verification are all essential. Compared to a single‑task LLM baseline, MOSAIC improves F1 by roughly 0.12 on average.
Practical Implications
- Scalable Annotation: Health‑tech platforms can automatically code large volumes of provider‑patient dialogs for quality‑improvement dashboards, compliance monitoring, or research datasets without hiring a team of annotators.
- Rapid Adaptation: Want to add a new communication framework (e.g., shared decision‑making)? Just plug in a new codebook and let the Plan Agent handle the workflow—no full‑model retraining needed.
- Continuous Learning: The Update Agent’s retrieval database can be refreshed daily with the latest clinical guidelines, ensuring the system stays aligned with evolving best practices.
- Developer Friendly: Because the system is built on LangGraph, developers can replace the underlying LLM (e.g., switch from OpenAI GPT‑4 to a locally hosted Llama 2) or swap out the vector store (FAISS, Milvus, etc.) with minimal code changes.
- Regulatory & Auditable: The Verification Agent provides a traceable consistency check, which can be logged for compliance audits or for generating human‑readable explanations of AI decisions.
Limitations & Future Work
- Training Data Size: The model was trained on only 26 gold‑standard transcripts; while performance is impressive, broader validation on larger, more diverse datasets is needed.
- Domain Transfer: Slight performance dip in OB/GYN suggests that additional domain‑specific fine‑tuning or richer retrieval corpora could improve generalization.
- Explainability: Although the Verification Agent logs inconsistencies, the system does not yet produce natural‑language rationales for each label—an area the authors plan to explore.
- Real‑World Deployment: Handling noisy audio transcriptions, multilingual conversations, and privacy‑preserving retrieval (e.g., on‑device embeddings) remain open challenges for production use.
Bottom line: MOSAIC demonstrates that an agentic, retrieval‑augmented approach can bring near‑human annotation quality to clinical communication coding—opening the door for scalable, adaptable AI tools in health‑tech ecosystems.
Authors
- Bohao Yang
- Rui Yang
- Joshua M. Biro
- Haoyuan Wang
- Jessica L. Handley
- Brianna Richardson
- Sophia Bessias
- Nicoleta Economou‑Zavlanos
- Armando D. Bedoya
- Monica Agrawal
- Michael M. Zavlanos
- Anand Chowdhury
- Raj M. Ratwani
- Kai Sun
- Kathryn I. Pollak
- Michael J. Pencina
- Chuan Hong
Paper Information
- arXiv ID: 2512.08659v1
- Categories: cs.CL, cs.LG
- Published: December 9, 2025
- PDF: Download PDF