Thinking Tokens Are Not Created Equal: Why Benchmarks Can't Distinguish Between 'Search' and 'Insight' (A PCP Experiment)

Published: (December 11, 2025 at 08:44 AM EST)
1 min read
Source: Dev.to

Source: Dev.to

Experiment Overview

I’ve been running experiments to understand how different “reasoning” models actually spend their thinking budget. The results suggest that we are looking at completely different cognitive species.

Post Correspondence Problem (PCP)

The PCP is theoretically undecidable in the general case (you can’t write an algorithm to solve every variation). However, finding a specific instance of a fixed length is a constraint‑satisfaction problem.

Domino Set Used in the Experiment

Type A: a  / ab
Type B: b  / ca
Type C: ca / a

Prompt Given to the Models

The models were asked to both design the dominoes and solve the puzzle based on the set above.

Observed Strategies

  • Simulation
  • Reverse Engineering
  • Pattern Matching
  • Inefficient Brute Force
  • Inefficient Brute Force but with maths

Conclusion

This experiment suggests that “reasoning” is a misleading umbrella term. If the real world is mostly “undecidable,” then the Architect approach (designing for safety) is fundamentally superior to the Brute Force approach (writing code and fuzz‑testing it until it works).

Back to Blog

Related posts

Read more »

Understanding Vibe Proving

How to make LLMs reason with verifiable, step-by-step logic Part 1 The post Understanding Vibe Proving appeared first on Towards Data Science....