[Paper] NNGPT: Rethinking AutoML with Large Language Models
Source: arXiv - 2511.20333v1
Overview
The paper introduces NNGPT, an open‑source AutoML framework that turns a large language model (LLM) into a self‑improving engine for designing, training, and evaluating neural networks—especially for computer‑vision tasks. By closing the loop between model generation, performance assessment, and LLM fine‑tuning, NNGPT can continuously expand its own “catalog” of viable architectures without human intervention.
Key Contributions
- Unified LLM‑driven AutoML pipeline that combines architecture synthesis, hyper‑parameter optimization, early‑stop/accuracy prediction, and code‑aware model generation in a single workflow.
- Self‑improving loop: generated models are executed, their results fed back to fine‑tune the LLM, effectively growing the dataset of neural‑network designs.
- NN‑RAG (Neural‑Network Retrieval‑Augmented Generation): a retrieval‑augmented module that assembles PyTorch code blocks from a curated corpus, achieving 73 % executability on 1,289 target specifications.
- Competitive performance with far fewer trials: one‑shot accuracy prediction matches traditional search‑based AutoML; HPO reaches RMSE 0.60 vs. Optuna’s 0.64; code‑aware predictor hits RMSE 0.14 (Pearson r = 0.78).
- Scalable generation: over 5 000 validated models have already been produced, demonstrating the framework’s ability to autonomously explore the design space.
Methodology
- Prompt‑based generation – A single natural‑language prompt is fed to a pre‑trained LLM (e.g., GPT‑4) that emits a full PyTorch pipeline: data preprocessing, model architecture, and hyper‑parameters.
- Execution & evaluation – The generated script is run end‑to‑end on a target dataset. Metrics (accuracy, training time, early‑stop signals) are recorded.
- Feedback loop – Results are stored in the LEMUR dataset, an audited collection of model specifications and outcomes. The LLM is then fine‑tuned on this growing corpus, improving its next‑generation quality.
- Retrieval‑augmented synthesis (NN‑RAG) – When the LLM needs to produce a specific code block (e.g., a custom residual unit), it first retrieves similar, proven snippets from the LEMUR corpus, then adapts them to the current context.
- Auxiliary predictors – Lightweight regression models trained on LEMUR predict final accuracy or early‑stop points from the generated code, allowing the system to discard low‑promise candidates before costly training.
- Reinforcement learning – The whole pipeline is treated as an RL environment where the reward is the validation performance; policy updates further steer the LLM toward high‑yield designs.
Results & Findings
| Component | Metric | NNGPT Performance | Baseline / Prior Art |
|---|---|---|---|
| NN‑RAG executability | % of generated scripts that run without error | 73 % (1,289 targets) | < 50 % for vanilla LLM generation |
| Hyper‑parameter optimization (HPO) | RMSE of predicted vs. actual performance | 0.60 | Optuna 0.64 |
| Code‑aware accuracy predictor | RMSE / Pearson r | 0.14 / 0.78 | N/A (first of its kind) |
| One‑shot prediction vs. search‑based AutoML | Final validation accuracy | Comparable (within 1 % of multi‑trial search) | Requires dozens of trials |
| Overall model generation | Number of validated models produced | >5 000 | — |
These numbers show that NNGPT can generate usable, high‑performing models with far fewer compute cycles than traditional AutoML tools, while also learning from each run to become better over time.
Practical Implications
- Rapid prototyping – Developers can obtain a ready‑to‑run PyTorch model for a new vision dataset with a single prompt, cutting weeks of manual architecture search.
- Cost‑effective AutoML – By predicting performance early and pruning bad candidates, organizations can slash GPU hours, making AutoML viable for smaller teams or edge‑device development.
- Continuous improvement – As more models are generated in‑house, the LLM fine‑tunes itself on proprietary data, yielding a custom AutoML engine that adapts to a company’s specific data distribution.
- Plug‑and‑play integration – The PyTorch adapter is framework‑agnostic; the same pipeline can be swapped for TensorFlow or JAX with minimal changes, facilitating adoption across existing codebases.
- Open‑source ecosystem – With code, prompts, and checkpoints slated for public release, the community can extend NN‑RAG, contribute new retrieval corpora, or specialize the system for domains beyond vision (e.g., NLP or reinforcement learning).
Limitations & Future Work
- Domain focus – Current experiments are limited to computer‑vision tasks; extending to other modalities may require new retrieval corpora and prompt engineering.
- LLM size dependency – High‑quality generation still relies on large proprietary models (e.g., GPT‑4); performance may degrade with smaller open models.
- Execution failures – Although NN‑RAG improves executability to 73 %, a quarter of generated scripts still crash, indicating room for better code sanitization or static analysis.
- Scalability of feedback loop – Fine‑tuning the LLM on a continuously growing LEMUR dataset could become computationally expensive; incremental or adapter‑based training strategies are suggested.
- Reinforcement learning stability – The RL component is nascent and can suffer from high variance; future work will explore more stable policy‑gradient methods and curriculum learning.
The authors plan to address these points by expanding the corpus to multimodal datasets, experimenting with open‑source LLMs, and integrating static code checkers to raise the executability ceiling.
Authors
- Roman Kochnev
- Waleed Khalid
- Tolgay Atinc Uzun
- Xi Zhang
- Yashkumar Sanjaybhai Dhameliya
- Furui Qin
- Chandini Vysyaraju
- Raghuvir Duvvuri
- Avi Goyal
- Dmitry Ignatov
- Radu Timofte
Paper Information
- arXiv ID: 2511.20333v1
- Categories: cs.AI, cs.LG, cs.NE
- Published: November 25, 2025
- PDF: Download PDF