[Paper] Polypersona: Persona-Grounded LLM for Synthetic Survey Responses
Source: arXiv - 2512.14562v1
Overview
The paper presents PolyPersona, a lightweight framework that teaches small language models (e.g., TinyLlama 1.1B, Phi‑2) to answer surveys while faithfully embodying a given “persona.” By combining parameter‑efficient LoRA adapters, 4‑bit quantization, and a dialogue‑driven data pipeline, the authors can generate thousands of realistic, persona‑grounded survey responses on a modest GPU budget.
Key Contributions
- Persona‑conditioned generation pipeline that preserves explicit persona cues throughout the dialogue, ensuring consistent behavior across multiple survey items.
- Resource‑adaptive training recipe: LoRA adapters + 4‑bit quantization enable instruction‑tuning of 1‑2 B‑parameter models on a single consumer‑grade GPU.
- Synthetic survey dataset: 3,568 responses covering 10 domains (e.g., health, finance, tech) and 433 distinct personas, released for reproducibility.
- Multi‑metric evaluation suite that blends classic NLG scores (BLEU, ROUGE, BERTScore) with survey‑specific metrics (structural coherence, stylistic consistency, sentiment alignment).
- Empirical evidence that compact models can match the quality of 7‑8 B‑parameter baselines on persona‑grounded survey generation (BLEU ≈ 0.09, ROUGE‑1 ≈ 0.43).
Methodology
- Data Collection – Human annotators create dialogue snippets that pair a persona description (age, occupation, preferences, etc.) with a series of survey questions. The dialogue format forces the model to see the persona repeatedly, reinforcing its “voice.”
- Instruction Tuning – The base chat model is frozen; only low‑rank LoRA adapters are trained on the dialogue data. Training runs in 4‑bit quantized mode, slashing memory usage while preserving gradient fidelity.
- Multi‑Domain Sampling – The same adapters are fine‑tuned on a mixed‑domain corpus, allowing the model to switch contexts (e.g., from “consumer electronics” to “public health”) without separate per‑domain heads.
- Evaluation – Generated responses are scored with:
- Standard NLG metrics (BLEU, ROUGE, BERTScore) for lexical/semantic similarity to human references.
- Survey‑specific checks:
- Structural coherence – does the answer follow the expected question‑answer pattern?
- Stylistic consistency – is the tone aligned with the persona’s profile?
- Sentiment alignment – does the sentiment match the persona’s stated preferences?
Results & Findings
| Model (params) | BLEU | ROUGE‑1 | BERTScore | Survey‑Coherence (↑) |
|---|---|---|---|---|
| TinyLlama 1.1B | 0.090 | 0.429 | 0.71 | 0.84 |
| Phi‑2 (2.7B) | 0.095 | 0.435 | 0.73 | 0.86 |
| Baseline 7B LLM | 0.092 | 0.432 | 0.72 | 0.85 |
- Compact models perform on par with 7‑8 B baselines despite a 4‑8× reduction in parameters and memory footprint.
- Persona fidelity scores (stylistic & sentiment) exceed 0.80, indicating the LoRA‑adapted models reliably maintain the persona across dozens of survey items.
- Training efficiency: full instruction‑tuning completes in ~6 hours on a single RTX 4090, compared to days for full‑parameter fine‑tuning of larger models.
Practical Implications
- Rapid prototyping of synthetic survey data – product teams can generate large, diverse response sets for A/B testing, UX research, or bias audits without recruiting thousands of participants.
- Cost‑effective bias analysis – by swapping persona attributes (e.g., gender, age, region) developers can surface how a downstream model reacts to demographic variations, all with a sub‑$10 GPU budget.
- Bootstrapping training data for downstream classifiers – sentiment or intent models that need labeled survey responses can be pre‑trained on PolyPersona‑generated data, reducing the need for expensive manual annotation.
- Edge‑friendly deployment – because the approach works with 1‑2 B‑parameter models, the same persona‑grounded generation can be shipped to on‑device applications (e.g., mobile health apps) where privacy or latency matters.
Limitations & Future Work
- Domain coverage – the current dataset spans ten domains; niche or highly regulated sectors (legal, medical) may need additional fine‑tuning.
- Persona depth – personas are defined by a limited set of attributes; richer backstories or dynamic persona evolution over time remain unexplored.
- Evaluation granularity – while the multi‑metric suite captures coherence and style, human‑in‑the‑loop validation is still required for high‑stakes applications.
- Future directions suggested by the authors include: scaling the pipeline to multimodal personas (e.g., voice or image cues), integrating reinforcement learning from human feedback to tighten sentiment alignment, and open‑sourcing a larger benchmark for cross‑model comparison.
Authors
- Tejaswani Dash
- Dinesh Karri
- Anudeep Vurity
- Gautam Datla
- Tazeem Ahmad
- Saima Rafi
- Rohith Tangudu
Paper Information
- arXiv ID: 2512.14562v1
- Categories: cs.CL, cs.AI
- Published: December 16, 2025
- PDF: Download PDF