[Paper] Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy

Published: (December 5, 2025 at 11:35 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.05858v1

Overview

The fourth Prompting Science Report investigates a seemingly intuitive trick: giving large language models (LLMs) a “persona” (e.g., “you are a physics expert”) before asking them tough, graduate‑level multiple‑choice questions. Across six popular models and two high‑stakes benchmarks (GPQA‑Diamond and MMLU‑Pro), the authors find that expert personas do not boost factual accuracy, and low‑knowledge personas (layperson, child, toddler) actually hurt performance.

Key Contributions

  • Systematic persona evaluation: Tested three persona strategies—in‑domain expert, off‑domain expert, and low‑knowledge—on six state‑of‑the‑art LLMs.
  • Robust benchmark selection: Used GPQA‑Diamond (hard science questions) and MMLU‑Pro (broad graduate‑level topics) to ensure results generalize across domains.
  • Empirical finding: Expert personas provide no consistent accuracy gain; only Gemini 2.0 Flash showed a modest improvement.
  • Negative impact of low‑knowledge personas: Assigning “layperson” or “toddler” prompts reliably reduced scores.
  • Clear guidance for practitioners: Demonstrates that persona prompting is not a shortcut for improving factual correctness.

Methodology

  1. Models – Six publicly known LLMs (including Gemini 2.0 Flash, GPT‑4, Claude, Llama 2, etc.) were accessed via their standard APIs.
  2. Benchmarks
    • GPQA‑Diamond: 1,000+ expert‑level science MCQs designed to be adversarial for LLMs.
    • MMLU‑Pro: A curated subset of the Massive Multitask Language Understanding benchmark covering science, engineering, and law at graduate difficulty.
  3. Prompt designs – For each question, three prompt families were generated:
    • In‑Domain Expert: “You are a physics expert. Answer the following question …” (matched to the question’s field).
    • Off‑Domain Expert: Same expert label but mismatched (e.g., physics expert for a law question).
    • Low‑Knowledge: “You are a layperson/young child/toddler. Answer …”.
      A no‑persona baseline (plain question) was also run for comparison.
  4. Evaluation – Accuracy was measured by exact match to the correct multiple‑choice option. Statistical significance was assessed with paired t‑tests across the full test sets.

The pipeline was fully automated, ensuring reproducibility and eliminating human bias in answer selection.

Results & Findings

Persona TypeGeneral Trend Across ModelsNotable Exception
In‑Domain ExpertNo significant accuracy lift vs. baseline.Gemini 2.0 Flash (+≈2 % absolute).
Off‑Domain ExpertNeutral to slightly negative impact; sometimes a small drop.None.
Low‑KnowledgeConsistently lower accuracy (‑3 % to ‑7 % on average).

Key takeaways

  • The “expert” cue does not make the model retrieve more correct facts.
  • Mismatched expertise can even confuse the model, leading to marginally worse answers.
  • Pretending to be a child or layperson degrades performance, likely because the model adopts a less precise reasoning style.

Practical Implications

  • Prompt engineering shortcut?
    Developers should not rely on persona prefixes to improve factual correctness in high‑stakes QA or decision‑support systems.
  • Use personas for style, not substance: If the goal is to adjust tone, formality, or audience framing, persona prompts remain useful—but they won’t replace rigorous retrieval or chain‑of‑thought prompting for accuracy.
  • Model selection matters: The modest gain for Gemini 2.0 Flash suggests that some models may be more “persona‑sensitive.” Teams should test on their target model before adopting persona tricks.
  • Testing pipelines: The study’s automated benchmark harness can be repurposed to evaluate other prompt tricks (e.g., “think step‑by‑step”, “cite sources”) across multiple models.

Limitations & Future Work

  • Scope of models: Only six models were examined; newer or open‑source LLMs could behave differently.
  • Single‑turn prompts: The study used a one‑shot question format. Multi‑turn dialogues or retrieval‑augmented pipelines might interact with personas in unforeseen ways.
  • Accuracy‑only metric: The authors measured exact‑match correctness; they did not assess answer confidence, calibration, or downstream utility.
  • Potential domain‑specific benefits: While no overall gain was seen, niche domains (e.g., medical diagnostics) might still benefit from carefully crafted expert personas combined with external knowledge bases.

Future research could explore persona‑aware retrieval, dynamic persona switching, or fine‑tuning models with persona‑labeled data to see whether deeper integration—not just a prompt prefix—can meaningfully improve factual performance.

Authors

  • Savir Basil
  • Ina Shapiro
  • Dan Shapiro
  • Ethan Mollick
  • Lilach Mollick
  • Lennart Meincke

Paper Information

  • arXiv ID: 2512.05858v1
  • Categories: cs.CL
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »