[Paper] PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

Published: (April 28, 2026 at 12:46 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.25840v1

Overview

The paper introduces PSI‑Bench, a systematic, clinically‑grounded benchmark for evaluating AI‑driven simulators of depressed patients. By moving beyond vague LLM‑based “judges,” the authors provide interpretable diagnostics that reveal how well these simulators capture realistic, diverse, and therapeutically appropriate dialogue behavior—an essential step for safe, scalable mental‑health training tools.

Key Contributions

  • PSI‑Bench framework: a multi‑level (turn, dialogue, population) evaluation suite that maps simulator outputs to clinically meaningful dimensions (e.g., emotional trajectory, lexical diversity, response length).
  • Interpretability: each metric is tied to a concrete therapeutic concept, allowing developers to see why a simulator succeeds or fails.
  • Extensive benchmarking: seven large language models (LLMs) are tested across two popular depression‑patient simulator architectures, exposing systematic shortcomings.
  • Human validation: expert clinicians rate a subset of simulated conversations, showing strong correlation with PSI‑Bench scores and confirming the benchmark’s real‑world relevance.
  • Open‑source release: the authors provide code, prompts, and evaluation scripts, enabling the community to extend the benchmark to other mental‑health conditions or simulation frameworks.

Methodology

  1. Define clinically relevant axes – The authors consulted mental‑health professionals to identify three layers of behavior to assess:
    • Turn‑level: length, lexical richness, sentiment polarity.
    • Dialogue‑level: emotional progression (negative → positive), resolution speed, consistency.
    • Population‑level: variability across simulated “patients” (e.g., different symptom profiles).
  2. Metric construction – For each axis, they built automatic measures (e.g., token count, type‑token ratio, sentiment scores from a validated affect classifier) and mapped them to clinical interpretations.
  3. Simulator setups – Two open‑source frameworks for depression patient simulation were used, each paired with seven LLM back‑ends ranging from 7B to 175B parameters.
  4. Benchmark execution – Hundreds of simulated dialogues were generated, metrics computed, and results aggregated into a concise diagnostic report per model‑framework pair.
  5. Human study – A panel of licensed therapists evaluated a random sample of dialogues, rating realism, therapeutic usefulness, and safety. Correlations between these human scores and PSI‑Bench metrics were calculated to validate the benchmark.

Results & Findings

AspectObservation
Response lengthSimulators tend to produce overly long replies, potentially overwhelming trainees.
Lexical diversityHigh type‑token ratios indicate “wordy” outputs that lack the concise phrasing typical of real patients.
Emotional trajectoryMost dialogues follow a uniform negative‑to‑positive arc, ignoring the non‑linear mood swings seen in clinical practice.
Resolution speedSimulated patients often “resolve” their distress within a few turns, under‑representing chronic or relapsing patterns.
VariabilityPopulation‑level diversity is low; different simulated patients behave similarly, limiting exposure to the full spectrum of depressive presentations.
Framework impactThe choice of simulation framework influences fidelity more than raw model size—smaller models can outperform larger ones if the framework encodes better clinical priors.
Human alignmentPearson correlation > 0.78 between PSI‑Bench scores and expert ratings, confirming that the automatic diagnostics reflect genuine clinical judgments.

Practical Implications

  • Training platforms: Developers of mental‑health chatbots or VR role‑play systems can plug PSI‑Bench into their CI pipelines to catch unrealistic patient behavior early, reducing the risk of training on misleading scenarios.
  • Model selection: The benchmark shows that a well‑designed simulation scaffold can outweigh sheer model scale, guiding teams to invest in domain‑specific prompts or rule‑based scaffolding rather than only chasing larger LLMs.
  • Safety & compliance: By flagging overly optimistic emotional trajectories or rapid “recovery” signals, PSI‑Bench helps ensure that simulated patients do not inadvertently teach harmful therapeutic shortcuts.
  • Extensibility: Because the metrics are modular, product teams can add condition‑specific dimensions (e.g., anxiety, PTSD) or integrate custom sentiment classifiers, making PSI‑Bench a reusable evaluation backbone for broader mental‑health AI.
  • Regulatory readiness: Transparent, clinically anchored metrics can support documentation required for medical device or AI‑in‑healthcare certifications, easing the path to market for simulation‑based training tools.

Limitations & Future Work

  • Scope limited to depression: While the framework is designed to be extensible, the current validation only covers depressive symptomatology; other disorders may need new clinical axes.
  • Reliance on automatic sentiment tools: The affect classifiers themselves inherit biases and may misinterpret nuanced language, potentially skewing some metrics.
  • Static prompts: The benchmark evaluates static LLM outputs; future work could incorporate adaptive prompting or reinforcement‑learning‑based simulators that evolve during a session.
  • Human study size: The expert validation involved a modest number of clinicians; larger, more diverse panels would strengthen generalizability.
  • Real‑world deployment testing: The authors plan to integrate PSI‑Bench into live training curricula to measure downstream effects on trainee competence and patient outcomes.

PSI‑Bench marks a decisive step toward trustworthy, interpretable, and clinically useful AI patient simulators—tools that could democratize high‑quality mental‑health training while keeping safety front‑and‑center.

Authors

  • Nguyen Khoi Hoang
  • Shuhaib Mehri
  • Tse-An Hsu
  • Yi-Jyun Sun
  • Quynh Xuan Nguyen Truong
  • Khoa D Doan
  • Dilek Hakkani‑Tür

Paper Information

  • arXiv ID: 2604.25840v1
  • Categories: cs.CL, cs.AI
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...