[Paper] How reliable are LLMs when it comes to playing dice?
Source: arXiv - 2606.07515v1
Overview
We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.
Key Contributions
This paper presents research in the following areas:
- cs.CL
- cs.AI
- cs.HC
- math.PR
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of cs.CL.
Authors
- Luca Avena
- Gianmarco Bet
- Bernardo Busoni
Paper Information
- arXiv ID: 2606.07515v1
- Categories: cs.CL, cs.AI, cs.HC, math.PR
- Published: June 5, 2026
- PDF: Download PDF