[Paper] A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Published: 2 months ago (February 5, 2026 at 01:53 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.06015v1

Overview

This paper investigates how well modern large language models (LLMs) can estimate the severity of post‑traumatic stress disorder (PTSD) from raw narrative text. By running a systematic benchmark on 11 state‑of‑the‑art models and over a thousand real‑world clinical entries, the authors uncover which prompts, reasoning tricks, and model‑size choices actually move the needle on prediction accuracy.

Key Contributions

Comprehensive benchmark of 11 leading LLMs (both open‑weight and closed‑weight) on a PTSD severity task with 1,437 participants.
Systematic prompt engineering study that varies contextual knowledge (subscale definitions, summary statistics, interview questions) to quantify its impact on performance.
Comparison of modeling strategies: zero‑shot vs. few‑shot, chain‑of‑thought reasoning depth, direct scalar vs. structured subscale prediction, output rescaling, and nine different ensembling techniques.
Empirical scaling insights: open‑weight models plateau after ~70 B parameters, while newer closed‑weight models (e.g., GPT‑4‑turbo, GPT‑5) keep improving.
Best‑in‑class recipe: an ensemble that mixes a supervised baseline with zero‑shot LLM outputs yields the highest correlation with ground‑truth PTSD scores.

Methodology

Data – The authors use a clinical corpus containing free‑form trauma narratives and self‑reported PTSD severity scores (derived from the standard PCL‑5 questionnaire).
Prompt families – For each model they craft several prompt templates:
- Minimal: just the raw narrative.
- Context‑rich: narrative + explicit definitions of each PTSD subscale, plus a brief statistical summary of the dataset.
- Interview‑style: narrative + the exact interview questions that generated the self‑report.
Model configurations –
- Zero‑shot: model receives only the prompt.
- Few‑shot: up to 5 exemplars of narrative‑score pairs are added.
- Reasoning depth: plain answer vs. chain‑of‑thought (CoT) prompting that forces the model to “think step‑by‑step”.
- Output format: direct scalar prediction (0‑100) vs. predicting each subscale separately and aggregating.
Ensembling – Nine strategies (simple averaging, weighted voting, stacking with a linear regressor, etc.) combine predictions from multiple LLMs and a supervised baseline (e.g., a fine‑tuned BERT).
Evaluation – Pearson/Spearman correlation and mean absolute error (MAE) against the gold‑standard PTSD scores.

Results & Findings

Factor	Effect on Accuracy
Contextual knowledge (subscale definitions + interview Qs)	↑ Correlation by ~0.08 (≈10 % relative gain)
Chain‑of‑thought reasoning	Consistently lower MAE (≈15 % improvement)
Model size – Open‑weight (LLaMA, DeepSeek)	Plateau after ~70 B parameters; larger models give diminishing returns
Model size – Closed‑weight (GPT‑3.5‑mini → GPT‑5)	Steady gains; GPT‑5 outperforms all others by a noticeable margin
Zero‑shot vs. Few‑shot	Few‑shot offers marginal benefit (≈2‑3 % boost) but adds prompt complexity
Structured subscale prediction	Slightly better calibration than direct scalar output
Best ensemble	Stacking a supervised BERT‑based regressor with top 3 zero‑shot LLMs yields the highest Pearson r (≈0.78) and lowest MAE (≈4.2 points on a 0‑100 scale)

In short, the “right” prompt + a bit of reasoning beats raw model size, and smart ensembling tops everything.

Practical Implications

Clinical decision support – Deploying a context‑rich prompt with CoT reasoning can turn an off‑the‑shelf LLM into a reliable triage tool for mental‑health professionals, flagging high‑severity cases for follow‑up.
Product design – SaaS platforms that ingest user‑generated health narratives (e.g., tele‑therapy apps) can improve risk scoring without costly model fine‑tuning, simply by adding structured definitions and a few exemplars.
Cost‑effective scaling – Since open‑weight models stop improving after ~70 B, companies can opt for smaller, open models plus a lightweight ensemble rather than paying for the latest closed‑weight API.
Regulatory compliance – The study highlights the importance of transparent prompting; audit logs can capture the exact prompt template used, aiding explainability requirements.
Rapid prototyping – The few‑shot and CoT techniques are easy to implement in existing LLM SDKs (OpenAI, Anthropic, Cohere), allowing developers to experiment with mental‑health scoring in days rather than weeks.

Limitations & Future Work

Dataset bias – The narratives come from a single clinical study; generalization to other languages, cultures, or trauma types remains untested.
Ground‑truth reliability – Self‑reported PTSD scores can be noisy; incorporating clinician‑rated labels could sharpen evaluation.
Safety & ethics – The paper does not explore potential harms of mis‑estimation (e.g., false reassurance), a critical next step before production deployment.
Model diversity – Only 11 LLMs were examined; newer multimodal or instruction‑tuned models might behave differently.
Longitudinal prediction – Future work could assess whether LLMs can track severity changes over time, opening doors to continuous monitoring tools.

Authors

Panagiotis Kaliosis
Adithya V Ganesan
Oscar N. E. Kjell
Whitney Ringwald
Scott Feltman
Melissa A. Carr
Dimitris Samaras
Camilo Ruggero
Benjamin J. Luft
Roman Kotov
Andrew H. Schwartz

Paper Information

arXiv ID: 2602.06015v1
Categories: cs.CL
Published: February 5, 2026
PDF: Download PDF

[Paper] A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks