[Paper] Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs
Source: arXiv - 2602.16106v1
Overview
The paper presents an algorithm‑based pipeline that dramatically improves the reliability of code translation performed by Large Language Models (LLMs). By inserting a language‑neutral intermediate specification before generating the target code, the authors show that translations between Python and Java become far more likely to compile, run, and pass the original test suite.
Key Contributions
- Intermediate Specification Layer – Introduces a language‑agnostic, algorithmic description of program intent (control flow, data types, I/O) that guides the final code generation step.
- Comprehensive Empirical Study – Evaluates five popular LLMs (e.g., GPT‑4, Claude, Llama 2) on two large benchmark suites (Avatar and CodeNet) in a paired “direct vs. algorithm‑based” experiment.
- Robust Evaluation Protocol – Every translated snippet is compiled, executed, and tested; failures are categorized using a unified, language‑aware taxonomy.
- Significant Accuracy Gains – Micro‑average correctness rises from 67.7 % (direct) to 78.5 % (algorithm‑based), a +10.8 % absolute improvement.
- Error‑type Reduction – Lexical/token errors disappear completely; incomplete constructs drop 72.7 %, structural/declaration issues 61.1 %, and runtime dependency/entry‑point failures 78.4 %.
Methodology
- Dataset Preparation – The authors selected 5 k+ code snippets from Avatar (Python↔Java) and CodeNet (various algorithmic problems) that already include unit tests.
- Two Translation Paths
- Direct: Prompt the LLM to translate the source file in a single step.
- Algorithm‑Based:
- Extract Intent – The LLM first produces a language‑neutral pseudo‑algorithm (e.g., a flow‑chart‑style description of loops, conditionals, variable types, and I/O).
- Validate Intent – Simple static checks ensure the specification is well‑formed.
- Generate Target Code – A second LLM call consumes the specification and emits the target language source.
- Automated Build & Test – Each output is compiled (Java) or type‑checked (Python), then run against the original test suite. Results are logged as: compile success/failure, runtime exception, timeout, or test pass/fail.
- Error Taxonomy – Failures are mapped to categories such as Lexical/Token, Incomplete Construct, Structural/Declaration, Runtime Dependency, and Entry‑Point.
- Metrics – Accuracy is defined strictly: a translation is correct only if it compiles, executes without error, and passes all tests. Micro‑averages across models, datasets, and directions are reported.
Results & Findings
| Metric | Direct Translation | Algorithm‑Based Pipeline |
|---|---|---|
| Overall micro‑average accuracy | 67.7 % | 78.5 % |
| Lexical/Token errors | Present in ~12 % of cases | 0 % |
| Incomplete constructs | 18 % | 5 % (‑72.7 %) |
| Structural/Declaration issues | 22 % | 8.5 % (‑61.1 %) |
| Runtime dependency / entry‑point failures | 15 % | 3.2 % (‑78.4 %) |
| Average test‑suite pass rate (among compilable) | 84 % | 92 % |
Interpretation:
- The intermediate specification acts as a contract that forces the model to reason about program semantics before emitting concrete syntax.
- Eliminating lexical/token mistakes suggests that many direct‑translation errors stem from the model’s “copy‑paste” behavior rather than genuine misunderstanding.
- The biggest win is in runtime stability—fewer infinite loops or missing
mainmethods—indicating that the pipeline helps preserve execution entry points and dependency ordering.
Practical Implications
- More Trustworthy Multilingual IDE Assistants – Integrating an intent‑extraction step can turn a “best‑effort” translator into a production‑grade tool that developers can rely on for code migration projects.
- Reduced Debugging Overhead – By cutting lexical and structural errors, developers spend less time fixing syntactic glitches and more time reviewing true semantic differences.
- Facilitates Legacy Modernization – Enterprises looking to port legacy Java services to Python (or vice‑versa) can automate bulk migration with higher confidence, accelerating cloud‑native refactors.
- Template for Other Domains – The same pipeline could be applied to SQL ↔ NoSQL translation, API client generation, or even hardware description languages, wherever preserving intent is critical.
- Better Benchmarking for LLMs – The taxonomy and strict “compile‑run‑test” metric provide a reproducible yardstick for future code‑translation research and for evaluating commercial LLM APIs.
Limitations & Future Work
- Scope of Languages – The study focuses on Python ↔ Java; extending to languages with markedly different paradigms (e.g., Rust, JavaScript, or functional languages) may surface new challenges.
- Specification Expressiveness – The intermediate pseudo‑algorithm is still textual; richer representations (ASTs, graph‑based models) could capture more nuanced semantics.
- Model Dependency – Gains vary across LLMs; smaller or less‑instruction‑tuned models may not benefit as much from the two‑step prompting.
- Performance Overhead – The pipeline doubles the number of LLM calls, increasing latency and cost—optimizations (e.g., caching intent or using a lighter “extractor” model) are needed for real‑time IDE use.
- Human‑in‑the‑Loop Validation – The current workflow is fully automated; incorporating developer review of the intermediate specification could further improve correctness, especially for ambiguous code.
Overall, the paper demonstrates that a modest algorithmic scaffold can turn LLM‑driven code translation from a risky experiment into a dependable engineering tool.
Authors
- Shahriar Rumi Dipto
- Saikat Mondal
- Chanchal K. Roy
Paper Information
- arXiv ID: 2602.16106v1
- Categories: cs.SE
- Published: February 18, 2026
- PDF: Download PDF