[Paper] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic

Published: (December 2, 2025 at 01:03 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.02987v1

Overview

The paper “Fine‑Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic” tackles a practical problem that many developers face when they try to turn natural‑language specifications into machine‑checkable logic: large language models (LLMs) often “hallucinate” – they produce syntactically plausible but semantically wrong logical formulas. By fine‑tuning an LLM on a carefully crafted grammar and a pipeline that converts English statements into Conjunctive Normal Form (CNF), the authors demonstrate a concrete way to curb these errors and generate reliable inputs for SAT solvers.

Key Contributions

  • Lang2Logic framework – an end‑to‑end pipeline that (1) parses English sentences, (2) maps them to first‑order logical expressions, and (3) converts those expressions to CNF for downstream satisfiability checking.
  • Self‑defined grammar for logical translation – a lightweight, rule‑based grammar that guides the model toward syntactically correct logical forms, reducing the search space for the LLM.
  • Fine‑tuning strategy – the authors fine‑tune a pre‑trained LLM on a dataset annotated with the custom grammar, showing that the model learns to avoid specific hallucination patterns seen in the base model.
  • Empirical evidence of hallucination correction – experiments reveal that the fine‑tuned model systematically fixes the same classes of errors (e.g., misplaced quantifiers, missing parentheses) that the vanilla model makes.
  • Open‑source tooling – the paper releases the grammar definitions, data generation scripts, and a Python library that wraps symbolic computation (SymPy) and SAT‑solver interfaces, making it easy for practitioners to adopt the approach.

Methodology

  1. Data Generation – The authors start with a corpus of English specifications (e.g., “Every request must eventually receive a response”). Using a handcrafted grammar, they automatically generate paired logical formulas in a normalized intermediate representation.
  2. Grammar‑Guided Tokenization – Tokens that correspond to logical operators, quantifiers, and parentheses are treated as special symbols, ensuring the model learns their exact placement.
  3. Fine‑Tuning – A base LLM (e.g., GPT‑2‑medium) is fine‑tuned on the generated pairs for several epochs, with a loss that heavily penalizes mismatched logical tokens.
  4. Post‑Processing Pipeline – The model’s raw output is fed into a symbolic computation library (SymPy) to validate syntactic correctness, then a deterministic CNF conversion routine produces the final SAT‑solver input.
  5. Evaluation – The authors compare the vanilla LLM, the fine‑tuned model, and a rule‑based baseline on two metrics: (a) Logical Accuracy (exact match to the gold formula) and (b) SAT‑Solver Success Rate (whether the generated CNF leads to the same satisfiability outcome as the gold CNF).

Results & Findings

ModelLogical AccuracySAT‑Solver Success
Base LLM (no fine‑tune)68 %61 %
Rule‑based baseline74 %70 %
Fine‑tuned LLM88 %84 %
  • The fine‑tuned model eliminates the most common hallucination types: missing quantifiers (reduced from 22 % to 4 %) and malformed parentheses (from 18 % to 3 %).
  • When fed into a standard SAT solver (MiniSat), the CNFs generated by the fine‑tuned model yield the correct satisfiability result 84 % of the time, a 23‑point jump over the unmodified LLM.
  • Qualitative analysis shows that the model learns to respect the grammar’s precedence rules, producing logically equivalent but syntactically cleaner formulas.

Practical Implications

  • Automated Specification Checking – Teams can embed Lang2Logic into CI pipelines to automatically translate natural‑language requirements into CNF and run SAT checks, catching contradictory specs early.
  • Debugging & Invariant Generation – Developers writing loop invariants or pre/post‑conditions can get instant, formally verified logical forms, reducing manual translation errors.
  • Safety‑Critical Systems – In domains like aerospace or medical devices, where formal verification is mandatory, the approach offers a low‑effort bridge from stakeholder language to provable models.
  • Tooling Integration – The released Python library can be wrapped around IDE plugins (e.g., VS Code extensions) to provide real‑time feedback on the logical soundness of comments or docstrings.
  • Cost‑Effective Fine‑Tuning – Because the grammar‑driven dataset is synthetically generated, organizations can fine‑tune their own LLMs on domain‑specific vocabularies without massive annotation effort.

Limitations & Future Work

  • Domain Coverage – The current grammar handles a subset of first‑order logic (e.g., conjunction, disjunction, universal/existential quantifiers) but does not yet support higher‑order constructs, temporal operators, or arithmetic constraints.
  • Scalability of Fine‑Tuning – Experiments used a medium‑sized LLM; scaling to larger models (e.g., GPT‑3‑class) may require more compute and careful regularization to avoid overfitting to the synthetic grammar.
  • Error Propagation – While the fine‑tuned model reduces hallucinations, any remaining syntax error still causes the downstream CNF conversion to fail; a fallback rule‑based validator is needed for production robustness.
  • User Study – The paper does not include a usability study with software engineers; future work could measure how the tool affects developer productivity and error rates in real projects.
  • Cross‑Language Support – Extending the pipeline to handle specifications written in languages other than English (or multilingual corpora) remains an open challenge.

Overall, “Lang2Logic” demonstrates that a modest amount of targeted fine‑tuning, combined with a well‑designed grammar, can dramatically improve the reliability of LLM‑driven logical translation—opening the door for more trustworthy AI‑assisted formal methods in everyday software development.

Authors

  • Muyu Pan
  • Dheeraj Kodakandla
  • Mahfuza Farooque

Paper Information

  • arXiv ID: 2512.02987v1
  • Categories: cs.CL, cs.AI
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »