[Paper] PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

Published: 20 hours ago (April 28, 2026 at 12:46 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.25840v1

Overview

The paper introduces PSI‑Bench, a systematic, clinically‑grounded benchmark for evaluating AI‑driven simulators of depressed patients. By moving beyond vague LLM‑based “judges,” the authors provide interpretable diagnostics that reveal how well these simulators capture realistic, diverse, and therapeutically appropriate dialogue behavior—an essential step for safe, scalable mental‑health training tools.

Key Contributions

PSI‑Bench framework: a multi‑level (turn, dialogue, population) evaluation suite that maps simulator outputs to clinically meaningful dimensions (e.g., emotional trajectory, lexical diversity, response length).
Interpretability: each metric is tied to a concrete therapeutic concept, allowing developers to see why a simulator succeeds or fails.
Extensive benchmarking: seven large language models (LLMs) are tested across two popular depression‑patient simulator architectures, exposing systematic shortcomings.
Human validation: expert clinicians rate a subset of simulated conversations, showing strong correlation with PSI‑Bench scores and confirming the benchmark’s real‑world relevance.
Open‑source release: the authors provide code, prompts, and evaluation scripts, enabling the community to extend the benchmark to other mental‑health conditions or simulation frameworks.

Methodology

Define clinically relevant axes – The authors consulted mental‑health professionals to identify three layers of behavior to assess:
- Turn‑level: length, lexical richness, sentiment polarity.
- Dialogue‑level: emotional progression (negative → positive), resolution speed, consistency.
- Population‑level: variability across simulated “patients” (e.g., different symptom profiles).
Metric construction – For each axis, they built automatic measures (e.g., token count, type‑token ratio, sentiment scores from a validated affect classifier) and mapped them to clinical interpretations.
Simulator setups – Two open‑source frameworks for depression patient simulation were used, each paired with seven LLM back‑ends ranging from 7B to 175B parameters.
Benchmark execution – Hundreds of simulated dialogues were generated, metrics computed, and results aggregated into a concise diagnostic report per model‑framework pair.
Human study – A panel of licensed therapists evaluated a random sample of dialogues, rating realism, therapeutic usefulness, and safety. Correlations between these human scores and PSI‑Bench metrics were calculated to validate the benchmark.

Results & Findings

Aspect	Observation
Response length	Simulators tend to produce overly long replies, potentially overwhelming trainees.
Lexical diversity	High type‑token ratios indicate “wordy” outputs that lack the concise phrasing typical of real patients.
Emotional trajectory	Most dialogues follow a uniform negative‑to‑positive arc, ignoring the non‑linear mood swings seen in clinical practice.
Resolution speed	Simulated patients often “resolve” their distress within a few turns, under‑representing chronic or relapsing patterns.
Variability	Population‑level diversity is low; different simulated patients behave similarly, limiting exposure to the full spectrum of depressive presentations.
Framework impact	The choice of simulation framework influences fidelity more than raw model size—smaller models can outperform larger ones if the framework encodes better clinical priors.
Human alignment	Pearson correlation > 0.78 between PSI‑Bench scores and expert ratings, confirming that the automatic diagnostics reflect genuine clinical judgments.

Practical Implications

Training platforms: Developers of mental‑health chatbots or VR role‑play systems can plug PSI‑Bench into their CI pipelines to catch unrealistic patient behavior early, reducing the risk of training on misleading scenarios.
Model selection: The benchmark shows that a well‑designed simulation scaffold can outweigh sheer model scale, guiding teams to invest in domain‑specific prompts or rule‑based scaffolding rather than only chasing larger LLMs.
Safety & compliance: By flagging overly optimistic emotional trajectories or rapid “recovery” signals, PSI‑Bench helps ensure that simulated patients do not inadvertently teach harmful therapeutic shortcuts.
Extensibility: Because the metrics are modular, product teams can add condition‑specific dimensions (e.g., anxiety, PTSD) or integrate custom sentiment classifiers, making PSI‑Bench a reusable evaluation backbone for broader mental‑health AI.
Regulatory readiness: Transparent, clinically anchored metrics can support documentation required for medical device or AI‑in‑healthcare certifications, easing the path to market for simulation‑based training tools.

Limitations & Future Work

Scope limited to depression: While the framework is designed to be extensible, the current validation only covers depressive symptomatology; other disorders may need new clinical axes.
Reliance on automatic sentiment tools: The affect classifiers themselves inherit biases and may misinterpret nuanced language, potentially skewing some metrics.
Static prompts: The benchmark evaluates static LLM outputs; future work could incorporate adaptive prompting or reinforcement‑learning‑based simulators that evolve during a session.
Human study size: The expert validation involved a modest number of clinicians; larger, more diverse panels would strengthen generalizability.
Real‑world deployment testing: The authors plan to integrate PSI‑Bench into live training curricula to measure downstream effects on trainee competence and patient outcomes.

PSI‑Bench marks a decisive step toward trustworthy, interpretable, and clinically useful AI patient simulators—tools that could democratize high‑quality mental‑health training while keeping safety front‑and‑center.

Authors

Nguyen Khoi Hoang
Shuhaib Mehri
Tse-An Hsu
Yi-Jyun Sun
Quynh Xuan Nguyen Truong
Khoa D Doan
Dilek Hakkani‑Tür

Paper Information

arXiv ID: 2604.25840v1
Categories: cs.CL, cs.AI
Published: April 28, 2026
PDF: Download PDF

[Paper] PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] Toward a Functional Geometric Algebra for Natural Language Semantics

[Paper] Three Models of RLHF Annotation: Extension, Evidence, and Authority

[Paper] Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling