[Paper] NNGPT: Rethinking AutoML with Large Language Models

Published: 2 months ago (November 25, 2025 at 09:10 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.20333v1

Overview

The paper introduces NNGPT, an open‑source AutoML framework that turns a large language model (LLM) into a self‑improving engine for designing, training, and evaluating neural networks—especially for computer‑vision tasks. By closing the loop between model generation, performance assessment, and LLM fine‑tuning, NNGPT can continuously expand its own “catalog” of viable architectures without human intervention.

Key Contributions

Unified LLM‑driven AutoML pipeline that combines architecture synthesis, hyper‑parameter optimization, early‑stop/accuracy prediction, and code‑aware model generation in a single workflow.
Self‑improving loop: generated models are executed, their results fed back to fine‑tune the LLM, effectively growing the dataset of neural‑network designs.
NN‑RAG (Neural‑Network Retrieval‑Augmented Generation): a retrieval‑augmented module that assembles PyTorch code blocks from a curated corpus, achieving 73 % executability on 1,289 target specifications.
Competitive performance with far fewer trials: one‑shot accuracy prediction matches traditional search‑based AutoML; HPO reaches RMSE 0.60 vs. Optuna’s 0.64; code‑aware predictor hits RMSE 0.14 (Pearson r = 0.78).
Scalable generation: over 5 000 validated models have already been produced, demonstrating the framework’s ability to autonomously explore the design space.

Methodology

Prompt‑based generation – A single natural‑language prompt is fed to a pre‑trained LLM (e.g., GPT‑4) that emits a full PyTorch pipeline: data preprocessing, model architecture, and hyper‑parameters.
Execution & evaluation – The generated script is run end‑to‑end on a target dataset. Metrics (accuracy, training time, early‑stop signals) are recorded.
Feedback loop – Results are stored in the LEMUR dataset, an audited collection of model specifications and outcomes. The LLM is then fine‑tuned on this growing corpus, improving its next‑generation quality.
Retrieval‑augmented synthesis (NN‑RAG) – When the LLM needs to produce a specific code block (e.g., a custom residual unit), it first retrieves similar, proven snippets from the LEMUR corpus, then adapts them to the current context.
Auxiliary predictors – Lightweight regression models trained on LEMUR predict final accuracy or early‑stop points from the generated code, allowing the system to discard low‑promise candidates before costly training.
Reinforcement learning – The whole pipeline is treated as an RL environment where the reward is the validation performance; policy updates further steer the LLM toward high‑yield designs.

Results & Findings

Component	Metric	NNGPT Performance	Baseline / Prior Art
NN‑RAG executability	% of generated scripts that run without error	73 % (1,289 targets)	< 50 % for vanilla LLM generation
Hyper‑parameter optimization (HPO)	RMSE of predicted vs. actual performance	0.60	Optuna 0.64
Code‑aware accuracy predictor	RMSE / Pearson r	0.14 / 0.78	N/A (first of its kind)
One‑shot prediction vs. search‑based AutoML	Final validation accuracy	Comparable (within 1 % of multi‑trial search)	Requires dozens of trials
Overall model generation	Number of validated models produced	>5 000	—

These numbers show that NNGPT can generate usable, high‑performing models with far fewer compute cycles than traditional AutoML tools, while also learning from each run to become better over time.

Practical Implications

Rapid prototyping – Developers can obtain a ready‑to‑run PyTorch model for a new vision dataset with a single prompt, cutting weeks of manual architecture search.
Cost‑effective AutoML – By predicting performance early and pruning bad candidates, organizations can slash GPU hours, making AutoML viable for smaller teams or edge‑device development.
Continuous improvement – As more models are generated in‑house, the LLM fine‑tunes itself on proprietary data, yielding a custom AutoML engine that adapts to a company’s specific data distribution.
Plug‑and‑play integration – The PyTorch adapter is framework‑agnostic; the same pipeline can be swapped for TensorFlow or JAX with minimal changes, facilitating adoption across existing codebases.
Open‑source ecosystem – With code, prompts, and checkpoints slated for public release, the community can extend NN‑RAG, contribute new retrieval corpora, or specialize the system for domains beyond vision (e.g., NLP or reinforcement learning).

Limitations & Future Work

Domain focus – Current experiments are limited to computer‑vision tasks; extending to other modalities may require new retrieval corpora and prompt engineering.
LLM size dependency – High‑quality generation still relies on large proprietary models (e.g., GPT‑4); performance may degrade with smaller open models.
Execution failures – Although NN‑RAG improves executability to 73 %, a quarter of generated scripts still crash, indicating room for better code sanitization or static analysis.
Scalability of feedback loop – Fine‑tuning the LLM on a continuously growing LEMUR dataset could become computationally expensive; incremental or adapter‑based training strategies are suggested.
Reinforcement learning stability – The RL component is nascent and can suffer from high variance; future work will explore more stable policy‑gradient methods and curriculum learning.

The authors plan to address these points by expanding the corpus to multimodal datasets, experimenting with open‑source LLMs, and integrating static code checkers to raise the executability ceiling.

Authors

Roman Kochnev
Waleed Khalid
Tolgay Atinc Uzun
Xi Zhang
Yashkumar Sanjaybhai Dhameliya
Furui Qin
Chandini Vysyaraju
Raghuvir Duvvuri
Avi Goyal
Dmitry Ignatov
Radu Timofte

Paper Information

arXiv ID: 2511.20333v1
Categories: cs.AI, cs.LG, cs.NE
Published: November 25, 2025
PDF: Download PDF

[Paper] NNGPT: Rethinking AutoML with Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval