[Paper] REMODEL-LLM: Transforming C code to Java using LLMs

Published: (December 12, 2025 at 04:25 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.11402v1

Overview

The paper REMODEL‑LLM evaluates how well compact, quantized large language models (LLMs) can automatically translate C programs into Java. By combining AST‑based code analysis with tightly‑controlled prompting, the authors expose a stark performance hierarchy among 19 sub‑20‑billion‑parameter models, highlighting both the promise and the current limits of using lightweight LLMs for cross‑language code migration.

Key Contributions

  • Comprehensive benchmark of 19 quantized LLMs (≤ 20 B parameters) on a curated C‑to‑Java translation suite.
  • Hybrid translation pipeline that first decomposes C source into an Abstract Syntax Tree (AST) for semantic grounding, then feeds a rule‑based prompt to the LLM for code generation.
  • Tiered performance taxonomy (Tier 1, 2, 3) that clearly separates models capable of producing runnable Java from those that cannot.
  • Empirical insight into specific C constructs (function pointers, sizeof, enums) that consistently break current quantized models.
  • Open‑source artifacts (datasets, prompts, evaluation scripts) to enable reproducibility and further research.

Methodology

  1. Dataset construction – A set of 200 short C snippets covering a wide range of language features (control flow, memory management, data structures) was assembled, each paired with a hand‑written reference Java translation.
  2. AST extraction – For every C snippet, the authors generated an AST using the Clang front‑end. The AST served two purposes:
    • It provided a language‑agnostic, structural representation that the prompt could reference.
    • It allowed automatic detection of unsupported constructs (e.g., pointer arithmetic) before invoking the model.
  3. Prompt design – A highly constrained, rule‑based prompt was crafted:
    • “You are a code translator. Use the following AST nodes to produce equivalent Java code. Do not add any imports beyond java.util.*. Do not use unsafe casts.”
    • The prompt also included a short “translation checklist” to steer the model toward correct memory‑model handling (e.g., replace manual malloc with new).
  4. Model inference – Each of the 19 quantized LLMs was run in a zero‑shot setting (no fine‑tuning) on the same prompt‑AST pair. The generated Java code was then compiled and executed against a suite of unit tests derived from the original C behavior.
  5. Evaluation metrics
    • Compilation success (runnable vs. non‑runnable).
    • Functional correctness (pass rate of unit tests).
    • Semantic fidelity (manual inspection for subtle logic errors).

Results & Findings

TierRepresentative ModelsCompilation SuccessFunctional Pass Rate
Tier 3llama3.1, gemma3, starcoder20 % (none compiled)0 %
Tier 2mistral‑nemo, mistral~70 % compiled10‑20 % passed tests (many semantic bugs)
Tier 1phi4, deepseek‑coder‑v2, codeqwen> 90 % compiled50‑65 % passed tests
  • Key failure patterns – Even Tier 1 models stumble on:
    • Function pointers (cannot map to Java interfaces or lambdas).
    • sizeof expressions (mis‑interpreted as literal numbers).
    • Enum‑based switch logic (produces incorrect case handling).
  • Reasoning ceiling – The authors observe that quantization (int8/int4) reduces the model’s capacity for multi‑step logical reasoning, which is essential for correctly handling pointer aliasing and low‑level memory calculations.

Practical Implications

  • Legacy code migration – For straightforward, boilerplate‑heavy C modules (e.g., simple I/O wrappers), Tier 1 quantized models can already generate usable Java code, potentially cutting migration effort by 30‑40 %.
  • Tooling integration – The AST‑plus‑prompt pipeline can be wrapped into IDE plugins or CI‑based migration assistants, offering developers an “auto‑suggest” mode that falls back to manual review when the model flags unsupported constructs.
  • Cost‑effective deployment – Because the evaluated models run comfortably on a single GPU with low memory footprints, organizations can run translation services in‑house without the expense of full‑scale LLM APIs.
  • Safety nets – The clear tiered performance suggests a practical workflow: run the model, automatically compile, and only promote translations that pass unit tests, thereby preventing the introduction of subtle bugs.

Limitations & Future Work

  • Scope of benchmark – The test suite focuses on small, self‑contained snippets; large, multi‑file projects with complex build systems were not evaluated.
  • Quantization impact – The study does not compare against full‑precision counterparts, leaving open the question of how much performance is lost due to quantization alone.
  • Prompt rigidity – The rule‑based prompt, while effective for consistency, may limit the model’s ability to generate idiomatic Java (e.g., using streams or generics).
  • Future directions suggested by the authors include:
    • Fine‑tuning or LoRA adapters on domain‑specific C‑to‑Java corpora.
    • Extending the AST‑prompt framework to support incremental translation of larger codebases.
    • Investigating hybrid approaches that combine LLM output with traditional rule‑based transpilers for the hardest constructs (function pointers, low‑level memory arithmetic).

Authors

  • Aryan Gupta
  • Y. Raghu Reddy

Paper Information

  • arXiv ID: 2512.11402v1
  • Categories: cs.SE, cs.AI
  • Published: December 12, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »