[Paper] Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

Published: 3 weeks ago (April 15, 2026 at 01:43 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.14121v1

Overview

Large language models (LLMs) can generate impressive answers, but the step‑by‑step “chain‑of‑thought” (CoT) they produce often contains hidden errors. The paper Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis shows that simply feeding LLMs the correct final answer does not fix these reasoning flaws. Instead, the authors introduce CRAFT, a framework that builds a Reasoning Knowledge Graph from the consensus portions of many candidate CoT traces and then synthesizes a cleaner, more reliable reasoning trace.

Key Contributions

Identification of two flaw categories in LLM reasoning traces:
1. Step Internal Flaws (logical mistakes, hallucinations within a step)
2. Step‑wise Flaws (over‑ or under‑thinking across steps).
Empirical evidence that providing ground‑truth answer labels to LLMs does not improve CoT quality.
CRAFT framework that:
- Generates multiple candidate CoT traces per query.
- Constructs a Reasoning Knowledge Graph (RKG) that captures the common sub‑steps shared across candidates.
- Performs topological generation to stitch together the consensus sub‑steps into a single, high‑quality trace.
Performance boost of +10 % average label‑prediction accuracy on both logical and mathematical benchmarks, surpassing all strong baselines.
Comprehensive evaluation showing improvements in trace coherence, correctness, and reduced hallucination rates.

Methodology

Sample Generation – For each problem, the LLM is prompted to produce N diverse CoT traces (e.g., via temperature sampling or varied prompts).
Graph Construction – Each trace is parsed into atomic reasoning steps (e.g., “apply distributive law”, “compute 7 × 8”). Nodes represent steps; directed edges encode the order. Identical or semantically equivalent steps from different traces are merged, forming the Reasoning Knowledge Graph.
Consensus Extraction – Nodes with high support (appearing in many traces) are considered reliable. Low‑support nodes are flagged as potential internal flaws.
Topological Synthesis – Starting from the graph’s source nodes, a new trace is generated by traversing the graph in topological order, preferentially selecting high‑support nodes while preserving logical dependencies.
Verification – The synthesized trace is optionally re‑fed to the LLM for a final answer check, ensuring the end result matches the original prediction.

The whole pipeline is model‑agnostic and can be wrapped around any existing CoT‑capable LLM.

Results & Findings

Benchmark	Baseline CoT (e.g., GPT‑4)	CRAFT‑enhanced	Relative Gain
Logical Reasoning (e.g., LSAT)	71.2 %	82.5 %	+11.3 %
Math Reasoning (e.g., GSM8K)	64.8 %	76.1 %	+11.3 %
Trace Quality (BLEU‑like metric)	0.58	0.71	+0.13

Error type reduction: Step Internal Flaws dropped by ~35 %; Step‑wise Flaws (over‑thinking) fell by ~28 %.
Trace diversity remained high, meaning CRAFT does not collapse all reasoning into a single “template” but preserves useful alternative reasoning paths.
Across all evaluated baselines (self‑consistency, majority‑vote CoT, verification‑prompting), CRAFT consistently outperformed, indicating robustness to prompt design and model size.

Practical Implications

More trustworthy AI assistants – Developers can embed CRAFT in chat‑bots or code‑assistants to surface clearer, error‑free reasoning, which is crucial for debugging or compliance‑heavy domains.
Reduced post‑processing – Instead of manually inspecting CoT logs for hallucinations, the graph‑based consensus automatically filters out dubious steps.
Improved few‑shot prompting – By generating multiple traces and synthesizing them, CRAFT mitigates the brittleness of a single prompt, making LLMs more reliable in production pipelines (e.g., automated report generation, data‑analysis notebooks).
Model‑agnostic plug‑in – Since CRAFT works on the output traces, it can be added on top of any existing LLM service (OpenAI, Anthropic, LLaMA, etc.) without retraining.
Potential for debugging – The RKG can be visualized, giving engineers a graph view of where the model diverges, aiding model‑level diagnostics and dataset curation.

Limitations & Future Work

Scalability – Building and traversing the RKG for very long reasoning tasks (e.g., multi‑page proofs) can become computationally expensive; optimizations or hierarchical graph construction are needed.
Semantic equivalence detection – Merging steps relies on heuristics (string similarity, simple paraphrase models). More sophisticated semantic parsers could improve consensus detection.
Dependence on diversity – If the initial set of candidate traces lacks sufficient variation, the consensus graph may miss alternative correct reasoning paths. Future work could explore active sampling strategies to maximize useful diversity.
Human‑in‑the‑loop evaluation – The paper’s metrics are largely automated; user studies assessing perceived trustworthiness of CRAFT‑generated traces would strengthen claims for real‑world deployment.

CRAFT opens a promising direction: treating LLM reasoning as a collaborative, consensus‑building process rather than a single monologue. For developers building AI‑driven tools, it offers a practical recipe to turn “correct answers with wrong steps” into truly reliable, explainable outputs.

Authors

Zipeng Ling
Shuliang Liu
Shenghong Fu
Yuehao Tang
Seonil Son
Yao Wan
Xuming Hu

Paper Information

arXiv ID: 2604.14121v1
Categories: cs.CL
Published: April 15, 2026
PDF: Download PDF

[Paper] Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text