[Paper] REMODEL-LLM: Transforming C code to Java using LLMs
Source: arXiv - 2512.11402v1
Overview
The paper REMODEL‑LLM evaluates how well compact, quantized large language models (LLMs) can automatically translate C programs into Java. By combining AST‑based code analysis with tightly‑controlled prompting, the authors expose a stark performance hierarchy among 19 sub‑20‑billion‑parameter models, highlighting both the promise and the current limits of using lightweight LLMs for cross‑language code migration.
Key Contributions
- Comprehensive benchmark of 19 quantized LLMs (≤ 20 B parameters) on a curated C‑to‑Java translation suite.
- Hybrid translation pipeline that first decomposes C source into an Abstract Syntax Tree (AST) for semantic grounding, then feeds a rule‑based prompt to the LLM for code generation.
- Tiered performance taxonomy (Tier 1, 2, 3) that clearly separates models capable of producing runnable Java from those that cannot.
- Empirical insight into specific C constructs (
function pointers,sizeof,enums) that consistently break current quantized models. - Open‑source artifacts (datasets, prompts, evaluation scripts) to enable reproducibility and further research.
Methodology
- Dataset construction – A set of 200 short C snippets covering a wide range of language features (control flow, memory management, data structures) was assembled, each paired with a hand‑written reference Java translation.
- AST extraction – For every C snippet, the authors generated an AST using the Clang front‑end. The AST served two purposes:
- It provided a language‑agnostic, structural representation that the prompt could reference.
- It allowed automatic detection of unsupported constructs (e.g., pointer arithmetic) before invoking the model.
- Prompt design – A highly constrained, rule‑based prompt was crafted:
- “You are a code translator. Use the following AST nodes to produce equivalent Java code. Do not add any imports beyond
java.util.*. Do not use unsafe casts.” - The prompt also included a short “translation checklist” to steer the model toward correct memory‑model handling (e.g., replace manual
mallocwithnew).
- “You are a code translator. Use the following AST nodes to produce equivalent Java code. Do not add any imports beyond
- Model inference – Each of the 19 quantized LLMs was run in a zero‑shot setting (no fine‑tuning) on the same prompt‑AST pair. The generated Java code was then compiled and executed against a suite of unit tests derived from the original C behavior.
- Evaluation metrics –
- Compilation success (runnable vs. non‑runnable).
- Functional correctness (pass rate of unit tests).
- Semantic fidelity (manual inspection for subtle logic errors).
Results & Findings
| Tier | Representative Models | Compilation Success | Functional Pass Rate |
|---|---|---|---|
| Tier 3 | llama3.1, gemma3, starcoder2 | 0 % (none compiled) | 0 % |
| Tier 2 | mistral‑nemo, mistral | ~70 % compiled | 10‑20 % passed tests (many semantic bugs) |
| Tier 1 | phi4, deepseek‑coder‑v2, codeqwen | > 90 % compiled | 50‑65 % passed tests |
- Key failure patterns – Even Tier 1 models stumble on:
- Function pointers (cannot map to Java interfaces or lambdas).
sizeofexpressions (mis‑interpreted as literal numbers).- Enum‑based switch logic (produces incorrect case handling).
- Reasoning ceiling – The authors observe that quantization (int8/int4) reduces the model’s capacity for multi‑step logical reasoning, which is essential for correctly handling pointer aliasing and low‑level memory calculations.
Practical Implications
- Legacy code migration – For straightforward, boilerplate‑heavy C modules (e.g., simple I/O wrappers), Tier 1 quantized models can already generate usable Java code, potentially cutting migration effort by 30‑40 %.
- Tooling integration – The AST‑plus‑prompt pipeline can be wrapped into IDE plugins or CI‑based migration assistants, offering developers an “auto‑suggest” mode that falls back to manual review when the model flags unsupported constructs.
- Cost‑effective deployment – Because the evaluated models run comfortably on a single GPU with low memory footprints, organizations can run translation services in‑house without the expense of full‑scale LLM APIs.
- Safety nets – The clear tiered performance suggests a practical workflow: run the model, automatically compile, and only promote translations that pass unit tests, thereby preventing the introduction of subtle bugs.
Limitations & Future Work
- Scope of benchmark – The test suite focuses on small, self‑contained snippets; large, multi‑file projects with complex build systems were not evaluated.
- Quantization impact – The study does not compare against full‑precision counterparts, leaving open the question of how much performance is lost due to quantization alone.
- Prompt rigidity – The rule‑based prompt, while effective for consistency, may limit the model’s ability to generate idiomatic Java (e.g., using streams or generics).
- Future directions suggested by the authors include:
- Fine‑tuning or LoRA adapters on domain‑specific C‑to‑Java corpora.
- Extending the AST‑prompt framework to support incremental translation of larger codebases.
- Investigating hybrid approaches that combine LLM output with traditional rule‑based transpilers for the hardest constructs (function pointers, low‑level memory arithmetic).
Authors
- Aryan Gupta
- Y. Raghu Reddy
Paper Information
- arXiv ID: 2512.11402v1
- Categories: cs.SE, cs.AI
- Published: December 12, 2025
- PDF: Download PDF