[Paper] Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs

Published: 3 days ago (February 17, 2026 at 07:34 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.16106v1

Overview

The paper presents an algorithm‑based pipeline that dramatically improves the reliability of code translation performed by Large Language Models (LLMs). By inserting a language‑neutral intermediate specification before generating the target code, the authors show that translations between Python and Java become far more likely to compile, run, and pass the original test suite.

Key Contributions

Intermediate Specification Layer – Introduces a language‑agnostic, algorithmic description of program intent (control flow, data types, I/O) that guides the final code generation step.
Comprehensive Empirical Study – Evaluates five popular LLMs (e.g., GPT‑4, Claude, Llama 2) on two large benchmark suites (Avatar and CodeNet) in a paired “direct vs. algorithm‑based” experiment.
Robust Evaluation Protocol – Every translated snippet is compiled, executed, and tested; failures are categorized using a unified, language‑aware taxonomy.
Significant Accuracy Gains – Micro‑average correctness rises from 67.7 % (direct) to 78.5 % (algorithm‑based), a +10.8 % absolute improvement.
Error‑type Reduction – Lexical/token errors disappear completely; incomplete constructs drop 72.7 %, structural/declaration issues 61.1 %, and runtime dependency/entry‑point failures 78.4 %.

Methodology

Dataset Preparation – The authors selected 5 k+ code snippets from Avatar (Python↔Java) and CodeNet (various algorithmic problems) that already include unit tests.
Two Translation Paths
- Direct: Prompt the LLM to translate the source file in a single step.
- Algorithm‑Based:
  1. Extract Intent – The LLM first produces a language‑neutral pseudo‑algorithm (e.g., a flow‑chart‑style description of loops, conditionals, variable types, and I/O).
  2. Validate Intent – Simple static checks ensure the specification is well‑formed.
  3. Generate Target Code – A second LLM call consumes the specification and emits the target language source.
Automated Build & Test – Each output is compiled (Java) or type‑checked (Python), then run against the original test suite. Results are logged as: compile success/failure, runtime exception, timeout, or test pass/fail.
Error Taxonomy – Failures are mapped to categories such as Lexical/Token, Incomplete Construct, Structural/Declaration, Runtime Dependency, and Entry‑Point.
Metrics – Accuracy is defined strictly: a translation is correct only if it compiles, executes without error, and passes all tests. Micro‑averages across models, datasets, and directions are reported.

Results & Findings

Metric	Direct Translation	Algorithm‑Based Pipeline
Overall micro‑average accuracy	67.7 %	78.5 %
Lexical/Token errors	Present in ~12 % of cases	0 %
Incomplete constructs	18 %	5 % (‑72.7 %)
Structural/Declaration issues	22 %	8.5 % (‑61.1 %)
Runtime dependency / entry‑point failures	15 %	3.2 % (‑78.4 %)
Average test‑suite pass rate (among compilable)	84 %	92 %

Interpretation:

The intermediate specification acts as a contract that forces the model to reason about program semantics before emitting concrete syntax.
Eliminating lexical/token mistakes suggests that many direct‑translation errors stem from the model’s “copy‑paste” behavior rather than genuine misunderstanding.
The biggest win is in runtime stability—fewer infinite loops or missing main methods—indicating that the pipeline helps preserve execution entry points and dependency ordering.

Practical Implications

More Trustworthy Multilingual IDE Assistants – Integrating an intent‑extraction step can turn a “best‑effort” translator into a production‑grade tool that developers can rely on for code migration projects.
Reduced Debugging Overhead – By cutting lexical and structural errors, developers spend less time fixing syntactic glitches and more time reviewing true semantic differences.
Facilitates Legacy Modernization – Enterprises looking to port legacy Java services to Python (or vice‑versa) can automate bulk migration with higher confidence, accelerating cloud‑native refactors.
Template for Other Domains – The same pipeline could be applied to SQL ↔ NoSQL translation, API client generation, or even hardware description languages, wherever preserving intent is critical.
Better Benchmarking for LLMs – The taxonomy and strict “compile‑run‑test” metric provide a reproducible yardstick for future code‑translation research and for evaluating commercial LLM APIs.

Limitations & Future Work

Scope of Languages – The study focuses on Python ↔ Java; extending to languages with markedly different paradigms (e.g., Rust, JavaScript, or functional languages) may surface new challenges.
Specification Expressiveness – The intermediate pseudo‑algorithm is still textual; richer representations (ASTs, graph‑based models) could capture more nuanced semantics.
Model Dependency – Gains vary across LLMs; smaller or less‑instruction‑tuned models may not benefit as much from the two‑step prompting.
Performance Overhead – The pipeline doubles the number of LLM calls, increasing latency and cost—optimizations (e.g., caching intent or using a lighter “extractor” model) are needed for real‑time IDE use.
Human‑in‑the‑Loop Validation – The current workflow is fully automated; incorporating developer review of the intermediate specification could further improve correctness, especially for ambiguous code.

Overall, the paper demonstrates that a modest algorithmic scaffold can turn LLM‑driven code translation from a risky experiment into a dependable engineering tool.

Authors

Shahriar Rumi Dipto
Saikat Mondal
Chanchal K. Roy

Paper Information

arXiv ID: 2602.16106v1
Categories: cs.SE
Published: February 18, 2026
PDF: Download PDF

[Paper] Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] huff: A Python package for Market Area Analysis

[Paper] What Makes a Good LLM Agent for Real-world Penetration Testing?

[Paper] Towards a Software Reference Architecture for Natural Language Processing Tools in Requirements Engineering

[Paper] The Runtime Dimension of Ethics in Self-Adaptive Systems