Thinking Tokens Are Not Created Equal: Why Benchmarks Can't Distinguish Between 'Search' and 'Insight' (A PCP Experiment)
Source: Dev.to
Experiment Overview
I’ve been running experiments to understand how different “reasoning” models actually spend their thinking budget. The results suggest that we are looking at completely different cognitive species.
Post Correspondence Problem (PCP)
The PCP is theoretically undecidable in the general case (you can’t write an algorithm to solve every variation). However, finding a specific instance of a fixed length is a constraint‑satisfaction problem.
Domino Set Used in the Experiment
Type A: a / ab
Type B: b / ca
Type C: ca / a
Prompt Given to the Models
The models were asked to both design the dominoes and solve the puzzle based on the set above.
Observed Strategies
- Simulation
- Reverse Engineering
- Pattern Matching
- Inefficient Brute Force
- Inefficient Brute Force but with maths
Conclusion
This experiment suggests that “reasoning” is a misleading umbrella term. If the real world is mostly “undecidable,” then the Architect approach (designing for safety) is fundamentally superior to the Brute Force approach (writing code and fuzz‑testing it until it works).