[Paper] Ask, Answer, and Detect: Role-Playing LLMs for Personality Detection with Question-Conditioned Mixture-of-Experts

Published: (December 9, 2025 at 12:07 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.08814v1

Overview

The paper introduces ROME, a new framework that uses large language models (LLMs) to “role‑play” as a test‑taker on classic personality questionnaires (e.g., MBTI, Big‑5). By turning a user’s raw social‑media posts into simulated answers to validated psychometric items, ROME creates a transparent bridge between noisy text and abstract personality labels, dramatically improving detection accuracy while keeping the reasoning process interpretable for developers.

Key Contributions

  • Psychology‑aware prompting: Leverages LLMs’ ability to answer questionnaire items, injecting domain‑specific knowledge directly into the model.
  • Question‑conditioned Mixture‑of‑Experts (MoE): A lightweight routing module that jointly processes the original post and the generated question context, learning to predict questionnaire answers as an auxiliary task.
  • Answer‑vector supervision: Converts the LLM‑generated answers into a structured “answer vector” that serves as rich intermediate supervision, alleviating the scarcity of labeled personality data.
  • Multi‑task learning pipeline: Simultaneously trains on answer prediction and final personality classification, yielding a more robust end‑to‑end system.
  • Strong empirical gains: Shows up to 15.4 % relative improvement over the best prior methods on a public Kaggle personality dataset and consistent gains on a second benchmark.

Methodology

  1. Data Preparation – Each user’s collection of posts is paired with a set of psychometric questions (e.g., “I enjoy social gatherings”).

  2. Role‑Playing LLM – A pre‑trained LLM (e.g., GPT‑3.5) is prompted to answer each question as if it were the user, using the user’s posts as context. This yields a question‑level answer (typically a Likert‑scale score).

  3. Question‑Conditioned MoE

    • The post text is encoded with a transformer encoder.
    • Each question is also embedded (via the same LLM or a smaller encoder).
    • A gating network decides which expert (a small feed‑forward sub‑network) should handle the interaction between a specific post and question pair.
  4. Answer Vector Construction – All predicted answers are concatenated into a fixed‑size vector that directly mirrors the structure of the original questionnaire.

  5. Multi‑Task Objective – Two losses are optimized together:

    • Answer prediction loss (supervised by the questionnaire’s ground‑truth answers on a small labeled subset).
    • Personality classification loss (standard cross‑entropy on MBTI/Big‑5 labels).

    The auxiliary answer task forces the model to learn psychologically meaningful representations, which in turn improves the final label prediction.

Results & Findings

DatasetBaseline (SOTA)ROME (ours)Relative ↑
Kaggle MBTI (≈ 10k users)71.2 % accuracy81.9 %15.4 %
Reddit Big‑5 (≈ 5k users)63.5 % F174.1 %16.8 %
  • Interpretability: The answer vectors expose which questionnaire items drove a particular personality prediction, a feature absent in black‑box text‑only models.
  • Data Efficiency: With only 5 % of the training data labeled, ROME still outperforms baselines trained on the full set, confirming the power of the auxiliary supervision.
  • Ablation: Removing the MoE routing or the answer‑prediction task drops performance by ~7 %, highlighting their complementary roles.

Practical Implications

  • Personalized UX: Developers can integrate ROME into recommendation engines, chatbots, or adaptive UI systems to infer user traits from existing interaction logs without needing explicit questionnaire responses.
  • Mental‑Health Tools: Clinicians can use the answer vectors as a first‑line screening aid, flagging users whose generated answers suggest risk factors (e.g., high neuroticism).
  • Compliance & Transparency: Because the model outputs human‑readable questionnaire scores, it satisfies emerging AI‑explainability regulations better than opaque embeddings.
  • Low‑Label Scenarios: Start‑ups with limited annotated personality data can bootstrap a high‑performing detector by fine‑tuning a generic LLM with ROME’s multi‑task setup.
  • Plug‑and‑Play: The MoE component is lightweight (≈ 2 M parameters) and can be attached to any existing transformer‑based text encoder, making migration to production straightforward.

Limitations & Future Work

  • Prompt Sensitivity: The quality of generated answers depends on prompt engineering; poorly crafted prompts can introduce bias.
  • Questionnaire Coverage: ROME currently assumes a fixed set of psychometric items; extending to other personality models (e.g., HEXACO) requires additional prompt design and modest retraining.
  • Scalability of LLM Inference: Real‑time role‑playing with large LLMs may be costly; future work could explore distilled or adapter‑based LLMs to reduce latency.
  • Cross‑Cultural Validity: The psychometric questions are primarily English‑centric; evaluating ROME on multilingual or culturally diverse corpora is an open direction.

Bottom line: ROME demonstrates that marrying LLMs’ generative strengths with classic psychological assessments yields a more data‑efficient, interpretable, and accurate personality detection pipeline—an approach that developers can adopt today to build smarter, user‑centric applications.

Authors

  • Yifan Lyu
  • Liang Zhang

Paper Information

  • arXiv ID: 2512.08814v1
  • Categories: cs.CL
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »