[Paper] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic

Published: 2 months ago (December 2, 2025 at 01:03 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02987v1

Overview

The paper “Fine‑Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic” tackles a practical problem that many developers face when they try to turn natural‑language specifications into machine‑checkable logic: large language models (LLMs) often “hallucinate” – they produce syntactically plausible but semantically wrong logical formulas. By fine‑tuning an LLM on a carefully crafted grammar and a pipeline that converts English statements into Conjunctive Normal Form (CNF), the authors demonstrate a concrete way to curb these errors and generate reliable inputs for SAT solvers.

Key Contributions

Lang2Logic framework – an end‑to‑end pipeline that (1) parses English sentences, (2) maps them to first‑order logical expressions, and (3) converts those expressions to CNF for downstream satisfiability checking.
Self‑defined grammar for logical translation – a lightweight, rule‑based grammar that guides the model toward syntactically correct logical forms, reducing the search space for the LLM.
Fine‑tuning strategy – the authors fine‑tune a pre‑trained LLM on a dataset annotated with the custom grammar, showing that the model learns to avoid specific hallucination patterns seen in the base model.
Empirical evidence of hallucination correction – experiments reveal that the fine‑tuned model systematically fixes the same classes of errors (e.g., misplaced quantifiers, missing parentheses) that the vanilla model makes.
Open‑source tooling – the paper releases the grammar definitions, data generation scripts, and a Python library that wraps symbolic computation (SymPy) and SAT‑solver interfaces, making it easy for practitioners to adopt the approach.

Methodology

Data Generation – The authors start with a corpus of English specifications (e.g., “Every request must eventually receive a response”). Using a handcrafted grammar, they automatically generate paired logical formulas in a normalized intermediate representation.
Grammar‑Guided Tokenization – Tokens that correspond to logical operators, quantifiers, and parentheses are treated as special symbols, ensuring the model learns their exact placement.
Fine‑Tuning – A base LLM (e.g., GPT‑2‑medium) is fine‑tuned on the generated pairs for several epochs, with a loss that heavily penalizes mismatched logical tokens.
Post‑Processing Pipeline – The model’s raw output is fed into a symbolic computation library (SymPy) to validate syntactic correctness, then a deterministic CNF conversion routine produces the final SAT‑solver input.
Evaluation – The authors compare the vanilla LLM, the fine‑tuned model, and a rule‑based baseline on two metrics: (a) Logical Accuracy (exact match to the gold formula) and (b) SAT‑Solver Success Rate (whether the generated CNF leads to the same satisfiability outcome as the gold CNF).

Results & Findings

Model	Logical Accuracy	SAT‑Solver Success
Base LLM (no fine‑tune)	68 %	61 %
Rule‑based baseline	74 %	70 %
Fine‑tuned LLM	88 %	84 %

The fine‑tuned model eliminates the most common hallucination types: missing quantifiers (reduced from 22 % to 4 %) and malformed parentheses (from 18 % to 3 %).
When fed into a standard SAT solver (MiniSat), the CNFs generated by the fine‑tuned model yield the correct satisfiability result 84 % of the time, a 23‑point jump over the unmodified LLM.
Qualitative analysis shows that the model learns to respect the grammar’s precedence rules, producing logically equivalent but syntactically cleaner formulas.

Practical Implications

Automated Specification Checking – Teams can embed Lang2Logic into CI pipelines to automatically translate natural‑language requirements into CNF and run SAT checks, catching contradictory specs early.
Debugging & Invariant Generation – Developers writing loop invariants or pre/post‑conditions can get instant, formally verified logical forms, reducing manual translation errors.
Safety‑Critical Systems – In domains like aerospace or medical devices, where formal verification is mandatory, the approach offers a low‑effort bridge from stakeholder language to provable models.
Tooling Integration – The released Python library can be wrapped around IDE plugins (e.g., VS Code extensions) to provide real‑time feedback on the logical soundness of comments or docstrings.
Cost‑Effective Fine‑Tuning – Because the grammar‑driven dataset is synthetically generated, organizations can fine‑tune their own LLMs on domain‑specific vocabularies without massive annotation effort.

Limitations & Future Work

Domain Coverage – The current grammar handles a subset of first‑order logic (e.g., conjunction, disjunction, universal/existential quantifiers) but does not yet support higher‑order constructs, temporal operators, or arithmetic constraints.
Scalability of Fine‑Tuning – Experiments used a medium‑sized LLM; scaling to larger models (e.g., GPT‑3‑class) may require more compute and careful regularization to avoid overfitting to the synthetic grammar.
Error Propagation – While the fine‑tuned model reduces hallucinations, any remaining syntax error still causes the downstream CNF conversion to fail; a fallback rule‑based validator is needed for production robustness.
User Study – The paper does not include a usability study with software engineers; future work could measure how the tool affects developer productivity and error rates in real projects.
Cross‑Language Support – Extending the pipeline to handle specifications written in languages other than English (or multilingual corpora) remains an open challenge.

Overall, “Lang2Logic” demonstrates that a modest amount of targeted fine‑tuning, combined with a well‑designed grammar, can dramatically improve the reliability of LLM‑driven logical translation—opening the door for more trustworthy AI‑assisted formal methods in everyday software development.

Authors

Muyu Pan
Dheeraj Kodakandla
Mahfuza Farooque

Paper Information

arXiv ID: 2512.02987v1
Categories: cs.CL, cs.AI
Published: December 2, 2025
PDF: Download PDF

[Paper] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis