[Paper] A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies
Source: arXiv - 2602.06015v1
Overview
This paper investigates how well modern large language models (LLMs) can estimate the severity of post‑traumatic stress disorder (PTSD) from raw narrative text. By running a systematic benchmark on 11 state‑of‑the‑art models and over a thousand real‑world clinical entries, the authors uncover which prompts, reasoning tricks, and model‑size choices actually move the needle on prediction accuracy.
Key Contributions
- Comprehensive benchmark of 11 leading LLMs (both open‑weight and closed‑weight) on a PTSD severity task with 1,437 participants.
- Systematic prompt engineering study that varies contextual knowledge (subscale definitions, summary statistics, interview questions) to quantify its impact on performance.
- Comparison of modeling strategies: zero‑shot vs. few‑shot, chain‑of‑thought reasoning depth, direct scalar vs. structured subscale prediction, output rescaling, and nine different ensembling techniques.
- Empirical scaling insights: open‑weight models plateau after ~70 B parameters, while newer closed‑weight models (e.g., GPT‑4‑turbo, GPT‑5) keep improving.
- Best‑in‑class recipe: an ensemble that mixes a supervised baseline with zero‑shot LLM outputs yields the highest correlation with ground‑truth PTSD scores.
Methodology
- Data – The authors use a clinical corpus containing free‑form trauma narratives and self‑reported PTSD severity scores (derived from the standard PCL‑5 questionnaire).
- Prompt families – For each model they craft several prompt templates:
- Minimal: just the raw narrative.
- Context‑rich: narrative + explicit definitions of each PTSD subscale, plus a brief statistical summary of the dataset.
- Interview‑style: narrative + the exact interview questions that generated the self‑report.
- Model configurations –
- Zero‑shot: model receives only the prompt.
- Few‑shot: up to 5 exemplars of narrative‑score pairs are added.
- Reasoning depth: plain answer vs. chain‑of‑thought (CoT) prompting that forces the model to “think step‑by‑step”.
- Output format: direct scalar prediction (0‑100) vs. predicting each subscale separately and aggregating.
- Ensembling – Nine strategies (simple averaging, weighted voting, stacking with a linear regressor, etc.) combine predictions from multiple LLMs and a supervised baseline (e.g., a fine‑tuned BERT).
- Evaluation – Pearson/Spearman correlation and mean absolute error (MAE) against the gold‑standard PTSD scores.
Results & Findings
| Factor | Effect on Accuracy |
|---|---|
| Contextual knowledge (subscale definitions + interview Qs) | ↑ Correlation by ~0.08 (≈10 % relative gain) |
| Chain‑of‑thought reasoning | Consistently lower MAE (≈15 % improvement) |
| Model size – Open‑weight (LLaMA, DeepSeek) | Plateau after ~70 B parameters; larger models give diminishing returns |
| Model size – Closed‑weight (GPT‑3.5‑mini → GPT‑5) | Steady gains; GPT‑5 outperforms all others by a noticeable margin |
| Zero‑shot vs. Few‑shot | Few‑shot offers marginal benefit (≈2‑3 % boost) but adds prompt complexity |
| Structured subscale prediction | Slightly better calibration than direct scalar output |
| Best ensemble | Stacking a supervised BERT‑based regressor with top 3 zero‑shot LLMs yields the highest Pearson r (≈0.78) and lowest MAE (≈4.2 points on a 0‑100 scale) |
In short, the “right” prompt + a bit of reasoning beats raw model size, and smart ensembling tops everything.
Practical Implications
- Clinical decision support – Deploying a context‑rich prompt with CoT reasoning can turn an off‑the‑shelf LLM into a reliable triage tool for mental‑health professionals, flagging high‑severity cases for follow‑up.
- Product design – SaaS platforms that ingest user‑generated health narratives (e.g., tele‑therapy apps) can improve risk scoring without costly model fine‑tuning, simply by adding structured definitions and a few exemplars.
- Cost‑effective scaling – Since open‑weight models stop improving after ~70 B, companies can opt for smaller, open models plus a lightweight ensemble rather than paying for the latest closed‑weight API.
- Regulatory compliance – The study highlights the importance of transparent prompting; audit logs can capture the exact prompt template used, aiding explainability requirements.
- Rapid prototyping – The few‑shot and CoT techniques are easy to implement in existing LLM SDKs (OpenAI, Anthropic, Cohere), allowing developers to experiment with mental‑health scoring in days rather than weeks.
Limitations & Future Work
- Dataset bias – The narratives come from a single clinical study; generalization to other languages, cultures, or trauma types remains untested.
- Ground‑truth reliability – Self‑reported PTSD scores can be noisy; incorporating clinician‑rated labels could sharpen evaluation.
- Safety & ethics – The paper does not explore potential harms of mis‑estimation (e.g., false reassurance), a critical next step before production deployment.
- Model diversity – Only 11 LLMs were examined; newer multimodal or instruction‑tuned models might behave differently.
- Longitudinal prediction – Future work could assess whether LLMs can track severity changes over time, opening doors to continuous monitoring tools.
Authors
- Panagiotis Kaliosis
- Adithya V Ganesan
- Oscar N. E. Kjell
- Whitney Ringwald
- Scott Feltman
- Melissa A. Carr
- Dimitris Samaras
- Camilo Ruggero
- Benjamin J. Luft
- Roman Kotov
- Andrew H. Schwartz
Paper Information
- arXiv ID: 2602.06015v1
- Categories: cs.CL
- Published: February 5, 2026
- PDF: Download PDF