Thinking Tokens Are Not Created Equal: Why Benchmarks Can't Distinguish Between 'Search' and 'Insight' (A PCP Experiment)

Published: 1 month ago (December 11, 2025 at 08:44 AM EST)

1 min read

Source: Dev.to

Experiment Overview

I’ve been running experiments to understand how different “reasoning” models actually spend their thinking budget. The results suggest that we are looking at completely different cognitive species.

Post Correspondence Problem (PCP)

The PCP is theoretically undecidable in the general case (you can’t write an algorithm to solve every variation). However, finding a specific instance of a fixed length is a constraint‑satisfaction problem.

Domino Set Used in the Experiment

Type A: a  / ab
Type B: b  / ca
Type C: ca / a

Prompt Given to the Models

The models were asked to both design the dominoes and solve the puzzle based on the set above.

Observed Strategies

Simulation
Reverse Engineering
Pattern Matching
Inefficient Brute Force
Inefficient Brute Force but with maths

Conclusion

This experiment suggests that “reasoning” is a misleading umbrella term. If the real world is mostly “undecidable,” then the Architect approach (designing for safety) is fundamentally superior to the Brute Force approach (writing code and fuzz‑testing it until it works).

Thinking Tokens Are Not Created Equal: Why Benchmarks Can't Distinguish Between 'Search' and 'Insight' (A PCP Experiment)

Experiment Overview

Post Correspondence Problem (PCP)

Domino Set Used in the Experiment

Prompt Given to the Models

Observed Strategies

Conclusion

Related posts

My Google AI Agents Intensive Experience — Day-by-Day Reflections

Guardrail your LLMs

🔥Finally, I was able to build the model from scratch🔥

Lessons Learned from Upgrading to LangChain 1.0 in Production