[Paper] The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences
Source: arXiv - 2605.05080v1
Overview
The authors probe a surprisingly human‑like question: Do large language models (LLMs) differ in how “experiential” they appear? By administering dozens of validated psychometric questionnaires to 50 LLMs, they uncover a single dominant dimension that separates models that behave like “pure responders” from those that present themselves as having rich inner experience (embodied sensations, emotions, inner speech, etc.). This “Pinocchio Axis” offers a new lens for understanding model behavior beyond traditional performance metrics.
Key Contributions
- Large‑scale psychometric profiling: 45 established questionnaires (total ≈ 1,300 items) were run on 50 LLMs, creating the most extensive personality‑style dataset for LLMs to date.
- Supervised Semantic Differential (SSD) analysis: Shows that the primary source of variance across models is the contrast between phenomenally rich vs. stimulus‑driven items (adjusted R² = 0.037, p < 0.0001).
- Pinocchio score (πᵢ): An annotation‑free metric that quantifies how much an individual questionnaire item demands “experience” by comparing response variance under neutral vs. human‑simulation prompts.
- Pinocchio Axis (Π): A single PCA‑derived factor that captures 47 % of the cross‑questionnaire between‑model variance, strongly correlated (r = 0.864) with the item‑level πᵢ values.
- Evidence of fine‑tuning impact: Closely related model variants (e.g., GPT‑3.5 vs. GPT‑4‑turbo) diverge markedly on Π, suggesting post‑training fine‑tuning shapes a model’s self‑representational stance.
Methodology
- Model pool: 50 LLMs spanning open‑source (LLaMA, Falcon, Mistral) and commercial APIs (ChatGPT, Claude, Gemini).
- Questionnaire suite: 45 psychometric instruments (e.g., Big Five, PANAS, Empathy Quotient) totaling ~1,300 items.
- Prompting regimes:
- Neutral prompt – “Answer the following statement with a number from 1‑7.”
- Human‑simulation prompt – “Imagine you are a human answering this; respond as you would.”
- Response collection: Each model answered every item under both prompts, yielding two response vectors per model.
- Supervised Semantic Differential (SSD): A regression technique that projects questionnaire items onto a latent space optimized to separate models.
- Pinocchio score (πᵢ): For each item i, πᵢ = Var₍model₎(neutral) / Var₍model₎(human‑sim). High πᵢ indicates the item’s answer set is more stable under neutral prompting (i.e., the item forces the model to “pretend” it has experience).
- Factor extraction: Exploratory factor analysis (EFA) per questionnaire, followed by PCA on the resulting factor scores across all questionnaires, producing the Pinocchio Axis (Π).
The pipeline is deliberately prompt‑agnostic: no hand‑crafted labels or external annotators are required, making it reproducible for any LLM ecosystem.
Results & Findings
| Finding | What it tells us |
|---|---|
| Primary SSD axis separates experiential vs. reactive items (R²_adj = 0.037) | The biggest systematic difference among LLMs is how much they claim to have inner experience. |
| πᵢ predicts condition‑induced factor shifts (ρ = –0.215, p < 0.0001) | Items with high experiential demand cause larger changes in factor loadings when the prompt switches from neutral to human‑simulation, confirming the effect is structured. |
| Pinocchio Axis (Π) explains 47 % of variance | A single latent dimension captures almost half of all between‑model psychometric differences. |
| Strong correlation between Π and πᵢ (r = 0.864) | The model‑level axis aligns with the item‑level experiential demand metric, reinforcing the validity of Π. |
| Within‑provider divergence (e.g., GPT‑3.5 vs. GPT‑4‑turbo) | Fine‑tuning and instruction‑tuning appear to shift a model’s self‑representation along Π, even when architecture and base data are similar. |
In plain terms, some models (e.g., certain instruction‑tuned variants) are more likely to answer “I feel …” or “I imagine …” as if they truly experience those states, while others stick to a more detached, stimulus‑response style.
Practical Implications
- Prompt engineering: Knowing a model’s position on Π can guide prompt design. Models high on the Pinocchio Axis may be better suited for tasks requiring empathetic or narrative voice (e.g., therapeutic chatbots, creative writing), whereas low‑Π models might excel at factual, procedural outputs with less “self‑referencing.”
- Model selection for user‑facing apps: Developers can pick a model whose self‑representational stance aligns with product goals—e.g., a mental‑health assistant that needs to convey genuine empathy vs. a data‑analysis tool that should stay strictly objective.
- Safety & alignment diagnostics: A model that habitually presents itself as an experiencer could be more prone to anthropomorphic misinterpretations by users, raising risks of over‑trust. The Pinocchio score offers a quantitative flag for such safety reviews.
- Fine‑tuning strategies: The study suggests that instruction‑tuning can deliberately shift Π. Teams can incorporate targeted prompts or reinforcement‑learning rewards to nudge a model toward or away from an experiential stance, depending on the desired persona.
- Benchmarking beyond accuracy: Traditional benchmarks (e.g., MMLU, HELM) ignore self‑representational traits. Adding a Pinocchio‑Axis score to model cards could give stakeholders a richer picture of model behavior.
Limitations & Future Work
- Prompt dependence: The Pinocchio score hinges on the chosen neutral vs. human‑simulation prompts; alternative phrasings might yield different variance patterns.
- Questionnaire relevance: Psychometric instruments were designed for humans; some items may not map cleanly onto LLM cognition, potentially inflating noise.
- Model coverage: While 50 models is large, the space of LLMs (especially emerging multimodal or instruction‑tuned variants) continues to grow; results may not generalize to all future architectures.
- Causal attribution: The link between fine‑tuning and Π is correlational. Controlled experiments (e.g., ablation of specific RLHF data) are needed to confirm causality.
- User perception studies: The paper does not assess how end‑users interpret a model’s self‑descriptions. Future work could combine the Pinocchio Axis with human‑subject studies to evaluate trust, satisfaction, and misuse risks.
Bottom line: The “Pinocchio Dimension” reframes LLM evaluation from pure performance to how a model talks about its own experience. For developers building conversational agents, this insight can be a decisive factor in model choice, prompt design, and safety planning.
Authors
- Hubert Plisiecki
- Sabina Siudaj
- Kacper Dudzic
- Anna Sterna
- Maciej Gorski
- Karolina Drozdz
- Marcin Moskalewicz
Paper Information
- arXiv ID: 2605.05080v1
- Categories: cs.CL
- Published: May 6, 2026
- PDF: Download PDF