[Paper] Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning

Published: 2 months ago (December 2, 2025 at 11:34 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02914v1

Overview

The paper introduces Martingale Score, an unsupervised metric that gauges how faithfully large language models (LLMs) update their beliefs during multi‑step reasoning. By borrowing the martingale property from Bayesian statistics, the authors expose a systematic tendency of LLMs to become more entrenched in their initial guesses rather than genuinely revising them in light of new evidence.

Key Contributions

Martingale Score: A regression‑based, unsupervised metric that quantifies violations of the Bayesian martingale property in LLM belief updates.
Empirical Survey: Large‑scale evaluation across three open‑ended domains (event forecasting, value‑laden questions, academic paper review) showing that belief entrenchment is pervasive across model families and prompting techniques.
Model & Technique Diagnosis: Identification of which model sizes, prompting styles (e.g., chain‑of‑thought, self‑consistency), and problem domains are most prone to entrenchment.
Ground‑Truth Correlation: Demonstration that higher Martingale Scores (i.e., larger violations) predict lower accuracy on tasks where gold labels exist, validating the metric as a proxy for truth‑seeking ability.
Open‑Source Toolkit: Release of code and evaluation scripts, enabling practitioners to compute Martingale Scores on their own LLM pipelines.

Methodology

Belief Representation – For each reasoning step, the LLM outputs a probability distribution (or a confidence score) over possible answers. This is treated as the model’s belief.
Martingale Property – In a rational Bayesian updater, the expected future belief, conditioned on the current belief, equals the current belief. In other words, the current belief should not systematically predict the direction of the next update.
Score Computation – The authors fit a simple linear regression that predicts the next‑step belief from the current belief across many reasoning trajectories. The regression coefficient (slope) measures predictability: a slope ≈ 0 indicates martingale behavior; a positive slope signals entrenchment. The absolute deviation from zero, normalized across tasks, is the Martingale Score.
Evaluation Protocol – They run multiple LLMs (GPT‑3.5, GPT‑4, LLaMA variants) on three benchmark suites, collecting belief trajectories under different prompting strategies (zero‑shot, chain‑of‑thought, self‑consistency).
Validation – For tasks with known answers (e.g., forecasting questions with later‑revealed outcomes), they correlate Martingale Scores with actual accuracy to test predictive power.

Results & Findings

Model / Prompt	Average Martingale Score	Correlation with Accuracy
GPT‑4 (CoT)	0.12	–0.48
GPT‑3.5 (Zero‑shot)	0.21	–0.62
LLaMA‑13B (Self‑Consistency)	0.34	–0.71

Widespread Entrenchment: Across all setups, the current belief positively predicts the next belief, indicating that models tend to double‑down on early guesses.
Prompt Sensitivity: Chain‑of‑thought (CoT) prompting reduces entrenchment compared to plain zero‑shot, but does not eliminate it. Self‑consistency sometimes amplifies the effect for smaller models.
Domain Differences: Value‑laden questions (e.g., ethical dilemmas) show the highest scores, while factual forecasting tasks are slightly less prone.
Predictive Validity: Higher Martingale Scores consistently align with lower downstream accuracy, confirming the metric’s usefulness as an unsupervised quality indicator.

Practical Implications

Debugging Reasoning Pipelines: Developers can run the Martingale Score on any multi‑step LLM workflow (e.g., tool‑use agents, iterative summarization) to spot when the model is “stuck” on an early hypothesis.
Prompt Engineering: The metric offers a quantitative way to compare prompting strategies; lower scores suggest more truth‑seeking behavior, guiding the design of better CoT or verification prompts.
Model Selection: When choosing a backbone for reasoning‑heavy applications (e.g., legal analysis, scientific literature review), Martingale Scores can serve as a model‑agnostic benchmark, especially when labeled data are scarce.
Safety & Alignment: Entrenched beliefs are a red flag for confirmation bias, which can amplify misinformation. Integrating Martingale‑based monitoring into LLM‑driven assistants could trigger fallback mechanisms (e.g., external fact‑checking) before the system confidently commits to a wrong answer.
Continuous Evaluation: Because the metric is unsupervised, it can be computed on‑the‑fly during production runs, enabling real‑time health checks without needing ground‑truth labels.

Limitations & Future Work

Score Sensitivity to Calibration: The metric assumes that the model’s confidence scores are well‑calibrated; miscalibrated probabilities could inflate or deflate the Martingale Score.
Scope of Tasks: The study focuses on open‑ended reasoning; it remains unclear how the metric behaves on tightly constrained tasks (e.g., code generation) where belief updates are less explicit.
Causal Interpretation: While a high score correlates with lower accuracy, the causal link between entrenchment and error is not fully established.
Future Directions: The authors suggest extending the framework to multi‑modal models, exploring interventions (e.g., stochastic belief perturbations) to break entrenchment, and integrating the score into reinforcement‑learning‑from‑human‑feedback loops for better alignment.

Authors

Zhonghao He
Tianyi Qiu
Hirokazu Shirado
Maarten Sap

Paper Information

arXiv ID: 2512.02914v1
Categories: cs.AI, cs.CL, cs.LG
Published: December 2, 2025
PDF: Download PDF

[Paper] Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis