[Paper] FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation
Source: arXiv - 2602.03798v1
Overview
FullStack‑Agent is a new LLM‑driven system that goes beyond generating pretty front‑ends and actually builds complete, production‑grade web applications—frontend, backend, and database. By combining a multi‑agent coding framework, a self‑learning data pipeline, and a dedicated benchmark, the authors show that large language models can reliably handle the full stack, opening the door to automated web development for non‑experts.
Key Contributions
- FullStack‑Dev: A multi‑agent architecture that integrates planning, code editing, repository navigation, and bug localization to manage end‑to‑end web development tasks.
- FullStack‑Learn: A data‑scaling/self‑improvement loop that back‑translates crawled and synthetically generated web repositories, fine‑tuning the underlying LLM without human annotation.
- FullStack‑Bench: The first systematic benchmark that evaluates generated sites on frontend rendering, backend API correctness, and database operations.
- Performance gains: FullStack‑Dev improves over the previous state‑of‑the‑art by 8.7 % (frontend), 38.2 % (backend), and 15.9 % (database). FullStack‑Learn further lifts a 30B model by 9.7 %, 9.5 %, and 2.8 % on the same metrics.
- Open‑source release: All code, data, and evaluation scripts are publicly available, encouraging reproducibility and community extensions.
Methodology
-
Multi‑Agent Planning & Execution
- A Planner LLM sketches the overall architecture (routing, data models, UI components).
- Editor agents iteratively write or modify code files, guided by a Navigator that can query the repository tree and retrieve relevant snippets.
- A Debugger agent runs unit/integration tests, pinpoints failing lines, and asks the Editor to apply patches.
-
Development‑Oriented Testing
- For each generated project, the system automatically spins up a containerized environment, runs a suite of frontend (Selenium‑style), backend (API), and database (SQL) tests, and records pass/fail signals used by the Debugger.
-
Self‑Improvement via Back‑Translation
- The authors crawl thousands of open‑source web repos, then reverse‑engineer them: the agents attempt to recreate the repo from a high‑level description, compare the result to the original, and generate correction data.
- This synthetic “error‑corrected” dataset is used to fine‑tune the backbone LLM (30B parameter model) in a continual learning loop, improving its ability to reason about full‑stack code.
-
Benchmark Construction
- FullStack‑Bench contains balanced test cases across three dimensions (frontend UI, backend logic, database schema & queries) with hidden ground truth, enabling fair comparison of different agents.
Results & Findings
| Metric | Prior SOTA | FullStack‑Dev | FullStack‑Learn (30B) |
|---|---|---|---|
| Frontend pass rate | – | +8.7 % | +9.7 % |
| Backend pass rate | – | +38.2 % | +9.5 % |
| Database pass rate | – | +15.9 % | +2.8 % |
- Backend leap: The 38 % boost shows the planner’s ability to correctly wire APIs, authentication, and data validation—areas where earlier agents usually stumble.
- Self‑learning impact: Even a modest 30B model gains double‑digit improvements after a single back‑translation round, confirming that the synthetic data is high‑quality and directly relevant.
- Robustness: Across 500+ generated sites, the Debugger reduced the average number of failing tests from 4.3 to 0.9, demonstrating effective automated bug localization.
Practical Implications
- Rapid prototyping for startups: Developers can describe a product idea in natural language and receive a ready‑to‑deploy full‑stack scaffold, cutting weeks of boilerplate work.
- Low‑code platforms: FullStack‑Agent can serve as the AI “engine” behind visual builders, automatically handling the hidden server‑side code that most low‑code tools omit.
- Automated migration & modernization: By feeding legacy codebases into the back‑translation pipeline, organizations could generate updated stacks (e.g., moving from monolith to micro‑services) with minimal manual effort.
- Education & onboarding: New engineers can experiment with end‑to‑end web projects without needing deep knowledge of each layer, accelerating learning curves.
- Continuous integration: The built‑in testing and debugging loop can be plugged into CI pipelines to auto‑repair failing builds in large codebases.
Limitations & Future Work
- Scalability to large codebases: The current system is evaluated on medium‑size demo projects; handling enterprise‑scale monoliths may require hierarchical planning and more sophisticated dependency analysis.
- Security & compliance: Generated code inherits the same security risks as any LLM output (e.g., injection vulnerabilities); a dedicated security audit module is still needed.
- Domain‑specific extensions: While the benchmark covers generic CRUD apps, specialized domains (e.g., real‑time streaming, ML inference services) are not yet addressed.
- Human‑in‑the‑loop refinement: The authors note that occasional manual guidance (e.g., clarifying ambiguous requirements) can dramatically improve outcomes, suggesting future work on seamless human‑AI collaboration interfaces.
FullStack‑Agent demonstrates that with the right orchestration of planning, testing, and self‑learning, LLMs can move from “pretty UI generators” to true full‑stack developers—an exciting step toward AI‑augmented software engineering.
Authors
- Zimu Lu
- Houxing Ren
- Yunqiao Yang
- Ke Wang
- Zhuofan Zong
- Mingjie Zhan
- Hongsheng Li
Paper Information
- arXiv ID: 2602.03798v1
- Categories: cs.SE, cs.CL, cs.CV
- Published: February 3, 2026
- PDF: Download PDF