[Paper] Beyond Correctness: Exposing LLM-generated Logical Flaws in Reasoning via Multi-step Automated Theorem Proving

Published: 3 weeks ago (December 29, 2025 at 09:48 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23511v1

Overview

Large Language Models (LLMs) can now produce multi‑step arguments that look convincing, but hidden logical mistakes can still slip through—an unacceptable risk in domains like healthcare or law. The paper introduces MATP, a framework that automatically translates an LLM’s natural‑language reasoning into formal First‑Order Logic (FOL) and then runs a theorem prover to check each inference step for validity. By exposing logical flaws that surface‑level checks miss, MATP pushes LLM evaluation beyond “does the answer look right?” toward “is the reasoning sound?”

Key Contributions

MATP pipeline: End‑to‑end system that (1) parses natural‑language reasoning into FOL, (2) feeds each step to a state‑of‑the‑art automated theorem prover, and (3) returns fine‑grained correctness labels per step.
Benchmark suite: Curated 10,830 reasoning instances from three diverse datasets (PrOntoQA‑OOD, ProofWriter, FOLIO) covering deduction, induction, and commonsense reasoning, generated by 10 different LLMs.
Empirical leap: MATP outperforms the strongest prompting‑based baselines by >42 % on step‑level verification accuracy.
Model‑level insights: Shows that purpose‑built “reasoning” models (e.g., GPT‑4‑Reasoning) produce fewer logical violations than generic chat models, highlighting the need for specialized training.
Error taxonomy: Provides a taxonomy of logical flaws (e.g., invalid inference, missing premises, contradictory assumptions) that can be automatically reported to developers.

Methodology

Natural‑Language to Logic Translation
- A lightweight prompt‑engineered LLM (or a fine‑tuned parser) converts each sentence of the LLM‑generated proof into a FOL clause.
- Ambiguities are resolved by grounding entities to a shared ontology extracted from the prompt context.
Step‑wise Theorem Proving
- For each inference step, MATP constructs a proof obligation: Given the premises up to step k‑1, does step k logically follow?
- The obligation is handed to an off‑the‑shelf automated theorem prover (e.g., E‑Prover, Vampire). The prover either finds a proof (step is valid) or reports a counterexample (step is invalid).
Result Aggregation & Classification
- Validity outcomes are aggregated into a per‑step report.
- Invalid steps are classified using a rule‑based matcher that maps counterexample patterns to the error taxonomy (e.g., “missing premise,” “universal instantiation error”).
Evaluation Protocol
- The pipeline is run on the benchmark, and its step‑verification accuracy is compared against baselines that rely on self‑consistency checks, fact‑checking APIs, or simple syntactic validators.

The whole process is fully automated, requires no human annotation after the initial dataset creation, and can be plugged into existing LLM inference pipelines.

Results & Findings

Metric	MATP	Best Prompt‑Based Baseline
Step‑level verification accuracy	78.4 %	35.9 %
Recall of logical errors (any step)	81.2 %	38.5 %
Precision of error classification	74.6 %	30.1 %

Model ranking: Reasoning‑oriented models (e.g., GPT‑4‑Reasoning, Claude‑2‑Reason) achieved >85 % step accuracy, while generic chat models hovered around 60 %.
Error distribution: The most common flaw was missing premise (≈42 % of errors), followed by invalid inference rule (≈28 %).
Scalability: The average verification time per proof (≈12 steps) was ~1.8 seconds on a single CPU core, making MATP viable for batch evaluation or on‑the‑fly checking in low‑latency settings.

Practical Implications

Safety nets for high‑stakes AI: Integrating MATP into LLM‑driven decision support (e.g., clinical decision aids, legal brief drafting) can automatically flag reasoning that looks plausible but is logically unsound, prompting human review before deployment.
Developer tooling: MATP can be exposed as a REST API or CLI utility that developers call after generating a chain‑of‑thought answer, receiving a step‑by‑step validity report that can be displayed in IDE extensions or CI pipelines.
Model fine‑tuning feedback: The fine‑grained error taxonomy gives concrete signals for reinforcement learning from human feedback (RLHF) loops—e.g., penalize missing‑premise patterns during training.
Benchmarking & model selection: Companies can benchmark their proprietary LLMs against MATP to choose the most logically reliable model for downstream products, rather than relying on surface‑level metrics like BLEU or ROUGE.
Regulatory compliance: For sectors where explainability and auditability are mandated (e.g., finance), MATP provides a provable log of logical correctness that can be attached to AI‑generated reports.

Limitations & Future Work

Translation bottleneck: Converting natural language to FOL still depends on an LLM, and errors in this step can propagate to the theorem prover, potentially under‑reporting flaws.
Expressivity constraints: First‑Order Logic cannot capture certain probabilistic or temporal reasoning patterns common in real‑world tasks, limiting MATP’s coverage.
Scalability to very long proofs: While per‑step verification is fast, extremely long chains (hundreds of steps) increase cumulative latency; incremental caching or parallel proof obligations are needed.
Domain‑specific ontologies: MATP assumes a reasonably well‑defined ontology; extending it to open‑domain or highly specialized vocabularies (e.g., biomedical terminology) will require richer grounding mechanisms.

Future research directions include:

Training dedicated NL‑to‑FOL parsers to reduce translation noise.
Exploring higher‑order logics or hybrid symbolic‑neural frameworks for richer reasoning.
Integrating MATP into interactive LLM assistants that can automatically suggest missing premises or corrected inference steps in real time.

Authors

Xinyi Zheng
Ningke Li
Xiaokun Luan
Kailong Wang
Ling Shi
Meng Sun
Haoyu Wang

Paper Information

arXiv ID: 2512.23511v1
Categories: cs.SE, cs.FL
Published: December 29, 2025
PDF: Download PDF

[Paper] Beyond Correctness: Exposing LLM-generated Logical Flaws in Reasoning via Multi-step Automated Theorem Proving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Applying Formal Methods Tools to an Electronic Warfare Codebase (Experience report)

[Paper] A Practical Guide to Establishing Technical Debt Management

[Paper] RITA: A Tool for Automated Requirements Classification and Specification from Online User Feedback

[Paper] Automation and Reuse Practices in GitHub Actions Workflows: A Practitioner's Perspective