[Paper] BanglaForge: LLM Collaboration with Self-Refinement for Bangla Code Generation

Published: 1 week ago (December 22, 2025 at 02:53 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.19122v1

Overview

The paper presents BanglaForge, a new framework that turns Bangla‑language function descriptions into working code. By combining retrieval‑augmented prompting, a dual‑model “coder‑reviewer” collaboration, and an iterative self‑refinement loop that uses execution feedback, the authors achieve a Pass@1 score of 84 % on the BLP‑2025 benchmark—far higher than prior attempts for this low‑resource language.

Key Contributions

BanglaForge framework: Introduces a retrieval‑augmented, dual‑model pipeline (coder + reviewer) specifically designed for Bangla‑to‑code generation.
Self‑refinement loop: Uses execution results to automatically trigger a reviewer model that rewrites buggy or incomplete code, improving robustness without human intervention.
Prompt engineering for Bangla: Systematic design of prompts that translate Bangla specifications into English for the LLM, then back‑translate the generated code into the target programming language.
Benchmark results: Sets a new state‑of‑the‑art Pass@1 of 84 % on the BLP‑2025 Bangla Code Generation benchmark, outperforming baseline LLMs by a large margin.
Open‑source resources: Releases the retrieval corpus, prompt templates, and evaluation scripts to foster reproducibility and community extensions.

Methodology

Retrieval‑augmented context – For each input description, BanglaForge first fetches the most relevant code snippets from a curated Bangla‑English parallel corpus using dense vector similarity. These snippets are injected into the prompt to give the LLM concrete examples.
Dual‑model collaboration –
- Coder model (e.g., GPT‑4‑Turbo) receives the retrieved examples and the Bangla specification, then produces an initial program.
- Reviewer model (a second LLM with a “debugger” prompt) receives the coder’s output plus the execution result (pass/fail, error messages). It rewrites the code to fix failures or improve edge‑case handling.
Iterative self‑refinement – The coder‑reviewer cycle repeats up to a fixed number of iterations (typically 2–3) or until the program passes all test cases. Because the reviewer sees concrete runtime feedback, it can target the exact failure mode rather than guessing.
Prompt engineering – The authors design a three‑stage prompt:
- Translation: Convert Bangla description to English using an LLM.
- Generation: Feed the English spec + retrieved examples to the coder.
- Refinement: Provide the reviewer with the coder’s output, test results, and a “review” instruction.
Evaluation – Generated programs are run against hidden unit tests from the BLP‑2025 benchmark. Pass@1 is measured as the proportion of problems for which the first generated solution succeeds.

Results & Findings

Metric	BanglaForge	Baseline LLM (no retrieval)	Prior State‑of‑the‑Art
Pass@1	84.0 %	58.2 %	71.5 %
Avg. refinement rounds	1.7	–	–
Retrieval hit‑rate (relevant snippet found)	92 %	–	–

Retrieval matters: Adding the most similar code snippet boosts Pass@1 by ~12 % compared to a plain in‑context LLM.
Self‑refinement gains: The reviewer model fixes ~70 % of the failures produced by the coder in the first pass, leading to the final 84 % success rate.
Language bridge works: Translating Bangla to English before generation avoids the need for a Bangla‑trained code model, leveraging the strong English‑code capabilities of existing LLMs.

Practical Implications

Rapid prototyping for Bangla‑speaking developers – Teams can describe a function in Bangla and obtain a ready‑to‑run implementation, cutting down on boilerplate coding time.
Low‑resource language support – BanglaForge demonstrates a recipe (retrieval + dual‑model refinement) that can be adapted to other under‑represented languages where large code‑centric datasets are scarce.
Automated code review pipelines – The reviewer component can be repurposed as a lightweight “AI code reviewer” that automatically patches failing snippets in CI/CD workflows.
Education & onboarding – Teaching programming concepts in Bangla becomes easier when students can see immediate, executable examples generated from natural‑language prompts.

Limitations & Future Work

Dependence on a high‑quality retrieval corpus – The system’s performance drops if relevant Bangla‑English code pairs are missing; building and maintaining such a corpus for other domains remains a challenge.
Translation bottleneck – Relying on an intermediate English translation adds latency and can introduce subtle semantic drift, especially for domain‑specific terminology.
Scalability of the reviewer – The current reviewer model is a full‑size LLM; future work could explore smaller, fine‑tuned models to reduce inference cost.
Generalization to larger projects – The study focuses on single‑function generation; extending the pipeline to multi‑file or full‑application synthesis is an open research direction.

BanglaForge offers a compelling blueprint for bringing LLM‑powered code generation to low‑resource languages, and its modular design invites the community to iterate, adapt, and scale the approach across languages and development contexts.

Authors

Mahir Labib Dihan
Sadif Ahmed
Md Nafiu Rahman

Paper Information

arXiv ID: 2512.19122v1
Categories: cs.SE, cs.CL
Published: December 22, 2025
PDF: Download PDF

[Paper] BanglaForge: LLM Collaboration with Self-Refinement for Bangla Code Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents