[Paper] BanglaForge: LLM Collaboration with Self-Refinement for Bangla Code Generation

Published: (December 22, 2025 at 02:53 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.19122v1

Overview

The paper presents BanglaForge, a new framework that turns Bangla‑language function descriptions into working code. By combining retrieval‑augmented prompting, a dual‑model “coder‑reviewer” collaboration, and an iterative self‑refinement loop that uses execution feedback, the authors achieve a Pass@1 score of 84 % on the BLP‑2025 benchmark—far higher than prior attempts for this low‑resource language.

Key Contributions

  • BanglaForge framework: Introduces a retrieval‑augmented, dual‑model pipeline (coder + reviewer) specifically designed for Bangla‑to‑code generation.
  • Self‑refinement loop: Uses execution results to automatically trigger a reviewer model that rewrites buggy or incomplete code, improving robustness without human intervention.
  • Prompt engineering for Bangla: Systematic design of prompts that translate Bangla specifications into English for the LLM, then back‑translate the generated code into the target programming language.
  • Benchmark results: Sets a new state‑of‑the‑art Pass@1 of 84 % on the BLP‑2025 Bangla Code Generation benchmark, outperforming baseline LLMs by a large margin.
  • Open‑source resources: Releases the retrieval corpus, prompt templates, and evaluation scripts to foster reproducibility and community extensions.

Methodology

  1. Retrieval‑augmented context – For each input description, BanglaForge first fetches the most relevant code snippets from a curated Bangla‑English parallel corpus using dense vector similarity. These snippets are injected into the prompt to give the LLM concrete examples.

  2. Dual‑model collaboration

    • Coder model (e.g., GPT‑4‑Turbo) receives the retrieved examples and the Bangla specification, then produces an initial program.
    • Reviewer model (a second LLM with a “debugger” prompt) receives the coder’s output plus the execution result (pass/fail, error messages). It rewrites the code to fix failures or improve edge‑case handling.
  3. Iterative self‑refinement – The coder‑reviewer cycle repeats up to a fixed number of iterations (typically 2–3) or until the program passes all test cases. Because the reviewer sees concrete runtime feedback, it can target the exact failure mode rather than guessing.

  4. Prompt engineering – The authors design a three‑stage prompt:

    • Translation: Convert Bangla description to English using an LLM.
    • Generation: Feed the English spec + retrieved examples to the coder.
    • Refinement: Provide the reviewer with the coder’s output, test results, and a “review” instruction.
  5. Evaluation – Generated programs are run against hidden unit tests from the BLP‑2025 benchmark. Pass@1 is measured as the proportion of problems for which the first generated solution succeeds.

Results & Findings

MetricBanglaForgeBaseline LLM (no retrieval)Prior State‑of‑the‑Art
Pass@184.0 %58.2 %71.5 %
Avg. refinement rounds1.7
Retrieval hit‑rate (relevant snippet found)92 %
  • Retrieval matters: Adding the most similar code snippet boosts Pass@1 by ~12 % compared to a plain in‑context LLM.
  • Self‑refinement gains: The reviewer model fixes ~70 % of the failures produced by the coder in the first pass, leading to the final 84 % success rate.
  • Language bridge works: Translating Bangla to English before generation avoids the need for a Bangla‑trained code model, leveraging the strong English‑code capabilities of existing LLMs.

Practical Implications

  • Rapid prototyping for Bangla‑speaking developers – Teams can describe a function in Bangla and obtain a ready‑to‑run implementation, cutting down on boilerplate coding time.
  • Low‑resource language support – BanglaForge demonstrates a recipe (retrieval + dual‑model refinement) that can be adapted to other under‑represented languages where large code‑centric datasets are scarce.
  • Automated code review pipelines – The reviewer component can be repurposed as a lightweight “AI code reviewer” that automatically patches failing snippets in CI/CD workflows.
  • Education & onboarding – Teaching programming concepts in Bangla becomes easier when students can see immediate, executable examples generated from natural‑language prompts.

Limitations & Future Work

  • Dependence on a high‑quality retrieval corpus – The system’s performance drops if relevant Bangla‑English code pairs are missing; building and maintaining such a corpus for other domains remains a challenge.
  • Translation bottleneck – Relying on an intermediate English translation adds latency and can introduce subtle semantic drift, especially for domain‑specific terminology.
  • Scalability of the reviewer – The current reviewer model is a full‑size LLM; future work could explore smaller, fine‑tuned models to reduce inference cost.
  • Generalization to larger projects – The study focuses on single‑function generation; extending the pipeline to multi‑file or full‑application synthesis is an open research direction.

BanglaForge offers a compelling blueprint for bringing LLM‑powered code generation to low‑resource languages, and its modular design invites the community to iterate, adapt, and scale the approach across languages and development contexts.

Authors

  • Mahir Labib Dihan
  • Sadif Ahmed
  • Md Nafiu Rahman

Paper Information

  • arXiv ID: 2512.19122v1
  • Categories: cs.SE, cs.CL
  • Published: December 22, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »