[Paper] REMODEL-LLM: Transforming C code to Java using LLMs

Published: 1 month ago (December 12, 2025 at 04:25 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11402v1

Overview

The paper REMODEL‑LLM evaluates how well compact, quantized large language models (LLMs) can automatically translate C programs into Java. By combining AST‑based code analysis with tightly‑controlled prompting, the authors expose a stark performance hierarchy among 19 sub‑20‑billion‑parameter models, highlighting both the promise and the current limits of using lightweight LLMs for cross‑language code migration.

Key Contributions

Comprehensive benchmark of 19 quantized LLMs (≤ 20 B parameters) on a curated C‑to‑Java translation suite.
Hybrid translation pipeline that first decomposes C source into an Abstract Syntax Tree (AST) for semantic grounding, then feeds a rule‑based prompt to the LLM for code generation.
Tiered performance taxonomy (Tier 1, 2, 3) that clearly separates models capable of producing runnable Java from those that cannot.
Empirical insight into specific C constructs (function pointers, sizeof, enums) that consistently break current quantized models.
Open‑source artifacts (datasets, prompts, evaluation scripts) to enable reproducibility and further research.

Methodology

Dataset construction – A set of 200 short C snippets covering a wide range of language features (control flow, memory management, data structures) was assembled, each paired with a hand‑written reference Java translation.
AST extraction – For every C snippet, the authors generated an AST using the Clang front‑end. The AST served two purposes:
- It provided a language‑agnostic, structural representation that the prompt could reference.
- It allowed automatic detection of unsupported constructs (e.g., pointer arithmetic) before invoking the model.
Prompt design – A highly constrained, rule‑based prompt was crafted:
- “You are a code translator. Use the following AST nodes to produce equivalent Java code. Do not add any imports beyond java.util.*. Do not use unsafe casts.”
- The prompt also included a short “translation checklist” to steer the model toward correct memory‑model handling (e.g., replace manual malloc with new).
Model inference – Each of the 19 quantized LLMs was run in a zero‑shot setting (no fine‑tuning) on the same prompt‑AST pair. The generated Java code was then compiled and executed against a suite of unit tests derived from the original C behavior.
Evaluation metrics –
- Compilation success (runnable vs. non‑runnable).
- Functional correctness (pass rate of unit tests).
- Semantic fidelity (manual inspection for subtle logic errors).

Results & Findings

Tier	Representative Models	Compilation Success	Functional Pass Rate
Tier 3	llama3.1, gemma3, starcoder2	0 % (none compiled)	0 %
Tier 2	mistral‑nemo, mistral	~70 % compiled	10‑20 % passed tests (many semantic bugs)
Tier 1	phi4, deepseek‑coder‑v2, codeqwen	> 90 % compiled	50‑65 % passed tests

Key failure patterns – Even Tier 1 models stumble on:
- Function pointers (cannot map to Java interfaces or lambdas).
- sizeof expressions (mis‑interpreted as literal numbers).
- Enum‑based switch logic (produces incorrect case handling).
Reasoning ceiling – The authors observe that quantization (int8/int4) reduces the model’s capacity for multi‑step logical reasoning, which is essential for correctly handling pointer aliasing and low‑level memory calculations.

Practical Implications

Legacy code migration – For straightforward, boilerplate‑heavy C modules (e.g., simple I/O wrappers), Tier 1 quantized models can already generate usable Java code, potentially cutting migration effort by 30‑40 %.
Tooling integration – The AST‑plus‑prompt pipeline can be wrapped into IDE plugins or CI‑based migration assistants, offering developers an “auto‑suggest” mode that falls back to manual review when the model flags unsupported constructs.
Cost‑effective deployment – Because the evaluated models run comfortably on a single GPU with low memory footprints, organizations can run translation services in‑house without the expense of full‑scale LLM APIs.
Safety nets – The clear tiered performance suggests a practical workflow: run the model, automatically compile, and only promote translations that pass unit tests, thereby preventing the introduction of subtle bugs.

Limitations & Future Work

Scope of benchmark – The test suite focuses on small, self‑contained snippets; large, multi‑file projects with complex build systems were not evaluated.
Quantization impact – The study does not compare against full‑precision counterparts, leaving open the question of how much performance is lost due to quantization alone.
Prompt rigidity – The rule‑based prompt, while effective for consistency, may limit the model’s ability to generate idiomatic Java (e.g., using streams or generics).
Future directions suggested by the authors include:
- Fine‑tuning or LoRA adapters on domain‑specific C‑to‑Java corpora.
- Extending the AST‑prompt framework to support incremental translation of larger codebases.
- Investigating hybrid approaches that combine LLM output with traditional rule‑based transpilers for the hardest constructs (function pointers, low‑level memory arithmetic).

Authors

Aryan Gupta
Y. Raghu Reddy

Paper Information

arXiv ID: 2512.11402v1
Categories: cs.SE, cs.AI
Published: December 12, 2025
PDF: Download PDF

[Paper] REMODEL-LLM: Transforming C code to Java using LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously