[Paper] Ask, Answer, and Detect: Role-Playing LLMs for Personality Detection with Question-Conditioned Mixture-of-Experts
Source: arXiv - 2512.08814v1
Overview
The paper introduces ROME, a new framework that uses large language models (LLMs) to “role‑play” as a test‑taker on classic personality questionnaires (e.g., MBTI, Big‑5). By turning a user’s raw social‑media posts into simulated answers to validated psychometric items, ROME creates a transparent bridge between noisy text and abstract personality labels, dramatically improving detection accuracy while keeping the reasoning process interpretable for developers.
Key Contributions
- Psychology‑aware prompting: Leverages LLMs’ ability to answer questionnaire items, injecting domain‑specific knowledge directly into the model.
- Question‑conditioned Mixture‑of‑Experts (MoE): A lightweight routing module that jointly processes the original post and the generated question context, learning to predict questionnaire answers as an auxiliary task.
- Answer‑vector supervision: Converts the LLM‑generated answers into a structured “answer vector” that serves as rich intermediate supervision, alleviating the scarcity of labeled personality data.
- Multi‑task learning pipeline: Simultaneously trains on answer prediction and final personality classification, yielding a more robust end‑to‑end system.
- Strong empirical gains: Shows up to 15.4 % relative improvement over the best prior methods on a public Kaggle personality dataset and consistent gains on a second benchmark.
Methodology
-
Data Preparation – Each user’s collection of posts is paired with a set of psychometric questions (e.g., “I enjoy social gatherings”).
-
Role‑Playing LLM – A pre‑trained LLM (e.g., GPT‑3.5) is prompted to answer each question as if it were the user, using the user’s posts as context. This yields a question‑level answer (typically a Likert‑scale score).
-
Question‑Conditioned MoE
- The post text is encoded with a transformer encoder.
- Each question is also embedded (via the same LLM or a smaller encoder).
- A gating network decides which expert (a small feed‑forward sub‑network) should handle the interaction between a specific post and question pair.
-
Answer Vector Construction – All predicted answers are concatenated into a fixed‑size vector that directly mirrors the structure of the original questionnaire.
-
Multi‑Task Objective – Two losses are optimized together:
- Answer prediction loss (supervised by the questionnaire’s ground‑truth answers on a small labeled subset).
- Personality classification loss (standard cross‑entropy on MBTI/Big‑5 labels).
The auxiliary answer task forces the model to learn psychologically meaningful representations, which in turn improves the final label prediction.
Results & Findings
| Dataset | Baseline (SOTA) | ROME (ours) | Relative ↑ |
|---|---|---|---|
| Kaggle MBTI (≈ 10k users) | 71.2 % accuracy | 81.9 % | 15.4 % |
| Reddit Big‑5 (≈ 5k users) | 63.5 % F1 | 74.1 % | 16.8 % |
- Interpretability: The answer vectors expose which questionnaire items drove a particular personality prediction, a feature absent in black‑box text‑only models.
- Data Efficiency: With only 5 % of the training data labeled, ROME still outperforms baselines trained on the full set, confirming the power of the auxiliary supervision.
- Ablation: Removing the MoE routing or the answer‑prediction task drops performance by ~7 %, highlighting their complementary roles.
Practical Implications
- Personalized UX: Developers can integrate ROME into recommendation engines, chatbots, or adaptive UI systems to infer user traits from existing interaction logs without needing explicit questionnaire responses.
- Mental‑Health Tools: Clinicians can use the answer vectors as a first‑line screening aid, flagging users whose generated answers suggest risk factors (e.g., high neuroticism).
- Compliance & Transparency: Because the model outputs human‑readable questionnaire scores, it satisfies emerging AI‑explainability regulations better than opaque embeddings.
- Low‑Label Scenarios: Start‑ups with limited annotated personality data can bootstrap a high‑performing detector by fine‑tuning a generic LLM with ROME’s multi‑task setup.
- Plug‑and‑Play: The MoE component is lightweight (≈ 2 M parameters) and can be attached to any existing transformer‑based text encoder, making migration to production straightforward.
Limitations & Future Work
- Prompt Sensitivity: The quality of generated answers depends on prompt engineering; poorly crafted prompts can introduce bias.
- Questionnaire Coverage: ROME currently assumes a fixed set of psychometric items; extending to other personality models (e.g., HEXACO) requires additional prompt design and modest retraining.
- Scalability of LLM Inference: Real‑time role‑playing with large LLMs may be costly; future work could explore distilled or adapter‑based LLMs to reduce latency.
- Cross‑Cultural Validity: The psychometric questions are primarily English‑centric; evaluating ROME on multilingual or culturally diverse corpora is an open direction.
Bottom line: ROME demonstrates that marrying LLMs’ generative strengths with classic psychological assessments yields a more data‑efficient, interpretable, and accurate personality detection pipeline—an approach that developers can adopt today to build smarter, user‑centric applications.
Authors
- Yifan Lyu
- Liang Zhang
Paper Information
- arXiv ID: 2512.08814v1
- Categories: cs.CL
- Published: December 9, 2025
- PDF: Download PDF