[Paper] PsihoRo: Depression and Anxiety Romanian Text Corpus

Published: (February 20, 2026 at 11:24 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.18324v1

Overview

The paper introduces PsihoRo, the first open‑source Romanian text corpus focused on depression and anxiety. By pairing short, open‑ended responses with clinically validated PHQ‑9 and GAD‑7 scores, the authors provide a rare, high‑quality resource for mental‑health NLP in a language that has been largely overlooked.

Key Contributions

  • First Romanian mental‑health corpus (205 participants) annotated with PHQ‑9 (depression) and GAD‑7 (anxiety) scores.
  • Data collection pipeline that combines open‑ended questionnaire items with standardized self‑report scales, ensuring reliable ground truth.
  • Baseline analyses using Romanian LIWC, emotion detection, and topic modeling to surface linguistic markers of distress.
  • Public release of the raw texts, questionnaire responses, and derived linguistic features under an open‑source license.

Methodology

  1. Survey Design – Participants completed a short form containing six open‑ended prompts (e.g., “Describe a recent situation that made you feel sad”) followed by the PHQ‑9 and GAD‑7 questionnaires.
  2. Recruitment & Ethics – 205 Romanian‑speaking volunteers were recruited online, gave informed consent, and were assured anonymity.
  3. Pre‑processing – Texts were tokenized, lemmatized, and cleaned of personally identifying information.
  4. Linguistic Annotation – The authors applied a Romanian version of the Linguistic Inquiry and Word Count (LIWC) dictionary to extract psychological categories (e.g., affect, cognition, social).
  5. Emotion & Topic Modeling – A pre‑trained multilingual emotion classifier provided fine‑grained emotion scores, while Latent Dirichlet Allocation (LDA) uncovered dominant discussion topics.
  6. Statistical Linking – Correlations between LIWC/emotion features and PHQ‑9/GAD‑7 scores were computed to validate that the corpus captures mental‑health signals.

Results & Findings

  • Strong linguistic signals: Higher depression scores correlated with increased use of first‑person singular pronouns, negative emotion words, and cognitive‑process terms (e.g., “think”, “know”).
  • Anxiety markers: Elevated GAD‑7 scores were linked to more frequent use of uncertainty words (e.g., “maybe”, “perhaps”) and fewer positive emotion terms.
  • Emotion classifier: The multilingual model reliably distinguished sadness, anxiety, and neutral states, achieving an average F1‑score of ~0.78 on a held‑out subset.
  • Topic insights: LDA revealed recurring themes such as “family relationships,” “work stress,” and “health concerns,” aligning with known risk factors for depression and anxiety in Romanian populations.

Practical Implications

  • Clinical decision support: Developers can fine‑tune sentiment or mental‑health classifiers on PsihoRo to build tools that flag at‑risk users in Romanian‑language mental‑health apps, forums, or tele‑therapy platforms.
  • Cross‑lingual research: The corpus enables transfer‑learning experiments, helping researchers evaluate how models trained on English mental‑health data perform on Romanian text.
  • Public‑health monitoring: Aggregated linguistic trends from PsihoRo can inform policymakers about prevalent stressors (e.g., economic uncertainty) within specific Romanian demographics.
  • Educational resources: Language‑learning platforms can incorporate mental‑health awareness modules, using authentic Romanian expressions of distress identified in the dataset.

Limitations & Future Work

  • Size & Diversity: With 205 respondents, the corpus is modest and may not capture the full sociolinguistic variation across Romania (e.g., regional dialects, age groups).
  • Self‑Report Bias: PHQ‑9 and GAD‑7 rely on participants’ willingness to disclose symptoms, which can introduce under‑reporting.
  • Domain Scope: The open‑ended prompts are limited to six topics; broader conversational data (e.g., social‑media posts) could enrich the linguistic landscape.
  • Future Directions: The authors plan to expand the dataset, incorporate multimodal signals (audio, facial expressions), and explore longitudinal tracking to study symptom trajectories over time.

Authors

  • Alexandra Ciobotaru
  • Ana‑Maria Bucur
  • Liviu P. Dinu

Paper Information

  • arXiv ID: 2602.18324v1
  • Categories: cs.CL
  • Published: February 20, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »