[Paper] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic
Source: arXiv - 2512.02987v1
Overview
The paper “Fine‑Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic” tackles a practical problem that many developers face when they try to turn natural‑language specifications into machine‑checkable logic: large language models (LLMs) often “hallucinate” – they produce syntactically plausible but semantically wrong logical formulas. By fine‑tuning an LLM on a carefully crafted grammar and a pipeline that converts English statements into Conjunctive Normal Form (CNF), the authors demonstrate a concrete way to curb these errors and generate reliable inputs for SAT solvers.
Key Contributions
- Lang2Logic framework – an end‑to‑end pipeline that (1) parses English sentences, (2) maps them to first‑order logical expressions, and (3) converts those expressions to CNF for downstream satisfiability checking.
- Self‑defined grammar for logical translation – a lightweight, rule‑based grammar that guides the model toward syntactically correct logical forms, reducing the search space for the LLM.
- Fine‑tuning strategy – the authors fine‑tune a pre‑trained LLM on a dataset annotated with the custom grammar, showing that the model learns to avoid specific hallucination patterns seen in the base model.
- Empirical evidence of hallucination correction – experiments reveal that the fine‑tuned model systematically fixes the same classes of errors (e.g., misplaced quantifiers, missing parentheses) that the vanilla model makes.
- Open‑source tooling – the paper releases the grammar definitions, data generation scripts, and a Python library that wraps symbolic computation (SymPy) and SAT‑solver interfaces, making it easy for practitioners to adopt the approach.
Methodology
- Data Generation – The authors start with a corpus of English specifications (e.g., “Every request must eventually receive a response”). Using a handcrafted grammar, they automatically generate paired logical formulas in a normalized intermediate representation.
- Grammar‑Guided Tokenization – Tokens that correspond to logical operators, quantifiers, and parentheses are treated as special symbols, ensuring the model learns their exact placement.
- Fine‑Tuning – A base LLM (e.g., GPT‑2‑medium) is fine‑tuned on the generated pairs for several epochs, with a loss that heavily penalizes mismatched logical tokens.
- Post‑Processing Pipeline – The model’s raw output is fed into a symbolic computation library (SymPy) to validate syntactic correctness, then a deterministic CNF conversion routine produces the final SAT‑solver input.
- Evaluation – The authors compare the vanilla LLM, the fine‑tuned model, and a rule‑based baseline on two metrics: (a) Logical Accuracy (exact match to the gold formula) and (b) SAT‑Solver Success Rate (whether the generated CNF leads to the same satisfiability outcome as the gold CNF).
Results & Findings
| Model | Logical Accuracy | SAT‑Solver Success |
|---|---|---|
| Base LLM (no fine‑tune) | 68 % | 61 % |
| Rule‑based baseline | 74 % | 70 % |
| Fine‑tuned LLM | 88 % | 84 % |
- The fine‑tuned model eliminates the most common hallucination types: missing quantifiers (reduced from 22 % to 4 %) and malformed parentheses (from 18 % to 3 %).
- When fed into a standard SAT solver (MiniSat), the CNFs generated by the fine‑tuned model yield the correct satisfiability result 84 % of the time, a 23‑point jump over the unmodified LLM.
- Qualitative analysis shows that the model learns to respect the grammar’s precedence rules, producing logically equivalent but syntactically cleaner formulas.
Practical Implications
- Automated Specification Checking – Teams can embed Lang2Logic into CI pipelines to automatically translate natural‑language requirements into CNF and run SAT checks, catching contradictory specs early.
- Debugging & Invariant Generation – Developers writing loop invariants or pre/post‑conditions can get instant, formally verified logical forms, reducing manual translation errors.
- Safety‑Critical Systems – In domains like aerospace or medical devices, where formal verification is mandatory, the approach offers a low‑effort bridge from stakeholder language to provable models.
- Tooling Integration – The released Python library can be wrapped around IDE plugins (e.g., VS Code extensions) to provide real‑time feedback on the logical soundness of comments or docstrings.
- Cost‑Effective Fine‑Tuning – Because the grammar‑driven dataset is synthetically generated, organizations can fine‑tune their own LLMs on domain‑specific vocabularies without massive annotation effort.
Limitations & Future Work
- Domain Coverage – The current grammar handles a subset of first‑order logic (e.g., conjunction, disjunction, universal/existential quantifiers) but does not yet support higher‑order constructs, temporal operators, or arithmetic constraints.
- Scalability of Fine‑Tuning – Experiments used a medium‑sized LLM; scaling to larger models (e.g., GPT‑3‑class) may require more compute and careful regularization to avoid overfitting to the synthetic grammar.
- Error Propagation – While the fine‑tuned model reduces hallucinations, any remaining syntax error still causes the downstream CNF conversion to fail; a fallback rule‑based validator is needed for production robustness.
- User Study – The paper does not include a usability study with software engineers; future work could measure how the tool affects developer productivity and error rates in real projects.
- Cross‑Language Support – Extending the pipeline to handle specifications written in languages other than English (or multilingual corpora) remains an open challenge.
Overall, “Lang2Logic” demonstrates that a modest amount of targeted fine‑tuning, combined with a well‑designed grammar, can dramatically improve the reliability of LLM‑driven logical translation—opening the door for more trustworthy AI‑assisted formal methods in everyday software development.
Authors
- Muyu Pan
- Dheeraj Kodakandla
- Mahfuza Farooque
Paper Information
- arXiv ID: 2512.02987v1
- Categories: cs.CL, cs.AI
- Published: December 2, 2025
- PDF: Download PDF