[Paper] A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Published: (February 5, 2026 at 01:53 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06015v1

Overview

This paper investigates how well modern large language models (LLMs) can estimate the severity of post‑traumatic stress disorder (PTSD) from raw narrative text. By running a systematic benchmark on 11 state‑of‑the‑art models and over a thousand real‑world clinical entries, the authors uncover which prompts, reasoning tricks, and model‑size choices actually move the needle on prediction accuracy.

Key Contributions

  • Comprehensive benchmark of 11 leading LLMs (both open‑weight and closed‑weight) on a PTSD severity task with 1,437 participants.
  • Systematic prompt engineering study that varies contextual knowledge (subscale definitions, summary statistics, interview questions) to quantify its impact on performance.
  • Comparison of modeling strategies: zero‑shot vs. few‑shot, chain‑of‑thought reasoning depth, direct scalar vs. structured subscale prediction, output rescaling, and nine different ensembling techniques.
  • Empirical scaling insights: open‑weight models plateau after ~70 B parameters, while newer closed‑weight models (e.g., GPT‑4‑turbo, GPT‑5) keep improving.
  • Best‑in‑class recipe: an ensemble that mixes a supervised baseline with zero‑shot LLM outputs yields the highest correlation with ground‑truth PTSD scores.

Methodology

  1. Data – The authors use a clinical corpus containing free‑form trauma narratives and self‑reported PTSD severity scores (derived from the standard PCL‑5 questionnaire).
  2. Prompt families – For each model they craft several prompt templates:
    • Minimal: just the raw narrative.
    • Context‑rich: narrative + explicit definitions of each PTSD subscale, plus a brief statistical summary of the dataset.
    • Interview‑style: narrative + the exact interview questions that generated the self‑report.
  3. Model configurations
    • Zero‑shot: model receives only the prompt.
    • Few‑shot: up to 5 exemplars of narrative‑score pairs are added.
    • Reasoning depth: plain answer vs. chain‑of‑thought (CoT) prompting that forces the model to “think step‑by‑step”.
    • Output format: direct scalar prediction (0‑100) vs. predicting each subscale separately and aggregating.
  4. Ensembling – Nine strategies (simple averaging, weighted voting, stacking with a linear regressor, etc.) combine predictions from multiple LLMs and a supervised baseline (e.g., a fine‑tuned BERT).
  5. Evaluation – Pearson/Spearman correlation and mean absolute error (MAE) against the gold‑standard PTSD scores.

Results & Findings

FactorEffect on Accuracy
Contextual knowledge (subscale definitions + interview Qs)↑ Correlation by ~0.08 (≈10 % relative gain)
Chain‑of‑thought reasoningConsistently lower MAE (≈15 % improvement)
Model size – Open‑weight (LLaMA, DeepSeek)Plateau after ~70 B parameters; larger models give diminishing returns
Model size – Closed‑weight (GPT‑3.5‑mini → GPT‑5)Steady gains; GPT‑5 outperforms all others by a noticeable margin
Zero‑shot vs. Few‑shotFew‑shot offers marginal benefit (≈2‑3 % boost) but adds prompt complexity
Structured subscale predictionSlightly better calibration than direct scalar output
Best ensembleStacking a supervised BERT‑based regressor with top 3 zero‑shot LLMs yields the highest Pearson r (≈0.78) and lowest MAE (≈4.2 points on a 0‑100 scale)

In short, the “right” prompt + a bit of reasoning beats raw model size, and smart ensembling tops everything.

Practical Implications

  • Clinical decision support – Deploying a context‑rich prompt with CoT reasoning can turn an off‑the‑shelf LLM into a reliable triage tool for mental‑health professionals, flagging high‑severity cases for follow‑up.
  • Product design – SaaS platforms that ingest user‑generated health narratives (e.g., tele‑therapy apps) can improve risk scoring without costly model fine‑tuning, simply by adding structured definitions and a few exemplars.
  • Cost‑effective scaling – Since open‑weight models stop improving after ~70 B, companies can opt for smaller, open models plus a lightweight ensemble rather than paying for the latest closed‑weight API.
  • Regulatory compliance – The study highlights the importance of transparent prompting; audit logs can capture the exact prompt template used, aiding explainability requirements.
  • Rapid prototyping – The few‑shot and CoT techniques are easy to implement in existing LLM SDKs (OpenAI, Anthropic, Cohere), allowing developers to experiment with mental‑health scoring in days rather than weeks.

Limitations & Future Work

  • Dataset bias – The narratives come from a single clinical study; generalization to other languages, cultures, or trauma types remains untested.
  • Ground‑truth reliability – Self‑reported PTSD scores can be noisy; incorporating clinician‑rated labels could sharpen evaluation.
  • Safety & ethics – The paper does not explore potential harms of mis‑estimation (e.g., false reassurance), a critical next step before production deployment.
  • Model diversity – Only 11 LLMs were examined; newer multimodal or instruction‑tuned models might behave differently.
  • Longitudinal prediction – Future work could assess whether LLMs can track severity changes over time, opening doors to continuous monitoring tools.

Authors

  • Panagiotis Kaliosis
  • Adithya V Ganesan
  • Oscar N. E. Kjell
  • Whitney Ringwald
  • Scott Feltman
  • Melissa A. Carr
  • Dimitris Samaras
  • Camilo Ruggero
  • Benjamin J. Luft
  • Roman Kotov
  • Andrew H. Schwartz

Paper Information

  • arXiv ID: 2602.06015v1
  • Categories: cs.CL
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »