[Paper] TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
Source: arXiv - 2604.14116v1
Overview
The paper presents TREX, a multi‑agent framework that automates the full life‑cycle of fine‑tuning large language models (LLMs). By treating the iterative fine‑tuning process as a tree‑structured search, TREX can plan, execute, and learn from multiple training experiments without human intervention, showing consistent performance gains across a suite of real‑world tasks.
Key Contributions
- Agent‑driven pipeline – Introduces two cooperating agents (Researcher & Executor) that handle everything from requirement gathering to model evaluation.
- Tree‑based exploration – Models multi‑round fine‑tuning as a searchable tree, enabling systematic planning, result reuse, and high‑level insight extraction.
- FT‑Bench – A new benchmark of 10 realistic fine‑tuning scenarios (e.g., capability upgrades, domain‑specific adaptation) to evaluate automated training systems.
- Empirical validation – Demonstrates that TREX outperforms baseline manual and naïve automated pipelines on all FT‑Bench tasks.
- Open‑source potential – The architecture is modular, making it straightforward to plug in different LLM back‑ends, data sources, or evaluation metrics.
Methodology
- Problem framing – Fine‑tuning is cast as a sequential decision problem: each experiment (choice of data, hyper‑parameters, curriculum, etc.) leads to a new state (model performance).
- Researcher agent
- Parses a high‑level user requirement (e.g., “improve medical QA”).
- Conducts open‑domain literature and data searches, curates candidate datasets, and proposes a training strategy (data mix, learning rate schedule, etc.).
- Executor agent
- Materializes the Researcher’s plan: builds data pipelines, launches training jobs, and collects evaluation metrics.
- Returns results and logs back to the Researcher.
- Tree‑based search
- Each node represents a specific fine‑tuning configuration and its outcome.
- The system expands promising nodes, prunes under‑performing branches, and re‑uses artifacts (e.g., pre‑processed datasets) across branches.
- A lightweight meta‑learner distills patterns from visited nodes to guide future proposals (e.g., “learning rate 2e‑5 works well for domain X”).
- Iterative loop – The agents repeat the propose‑execute‑evaluate cycle until a stopping criterion (budget, convergence, or target metric) is met.
Results & Findings
| FT‑Bench Task | Baseline (manual) | Naïve Auto‑Tune | TREX (best leaf) |
|---|---|---|---|
| General QA improvement | +3.2 % EM | +4.1 % EM | +6.8 % EM |
| Legal document summarization | +2.5 % ROUGE‑L | +3.0 % ROUGE‑L | +5.4 % ROUGE‑L |
| Code generation (Python) | +1.8 % Pass@1 | +2.2 % Pass@1 | +4.7 % Pass@1 |
| … (7 more) | … | … | … |
- Consistent gains: TREX outperformed both human‑crafted baselines and a simple grid‑search auto‑tuner on every task.
- Efficiency: By re‑using data recipes and pruning low‑yield branches, TREX reduced total GPU hours by ~30 % compared to exhaustive search.
- Insight extraction: The meta‑learner surfaced actionable rules (e.g., “mix 70 % domain data with 30 % general data for legal tasks”) that were later verified by the authors in a separate ablation study.
Practical Implications
- Rapid prototyping – Teams can feed a high‑level goal (e.g., “boost sentiment analysis on product reviews”) and let TREX generate a fine‑tuned model without hand‑crafting data pipelines or hyper‑parameter sweeps.
- Cost‑effective scaling – The tree‑search reuses intermediate artifacts, cutting down on redundant preprocessing and training runs, which translates to lower cloud compute bills.
- Continuous improvement loops – TREX can be hooked into CI/CD pipelines for LLM products, automatically re‑training when new data arrives or when performance drifts.
- Democratizing LLM customization – Smaller organizations lacking deep ML expertise can leverage the agent system to obtain domain‑adapted models that would otherwise require specialist effort.
- Integration points – The modular agents can be swapped for proprietary data crawlers, internal evaluation suites, or custom hardware schedulers, making TREX adaptable to existing MLOps stacks.
Limitations & Future Work
- Search space explosion – Although the tree pruning mitigates it, extremely large hyper‑parameter or data‑mix spaces can still overwhelm the system without tighter priors.
- Dependency on quality of external data – The Researcher’s literature and dataset mining relies on open‑source resources; noisy or biased sources can propagate into the fine‑tuned model.
- Evaluation bottleneck – Accurate assessment of each leaf often requires running the model on task‑specific benchmarks, which can be time‑consuming for large models.
- Future directions suggested by the authors include:
- Incorporating reinforcement‑learning‑based policy search to better balance exploration vs. exploitation.
- Extending TREX to handle multi‑modal models (e.g., vision‑language).
- Adding safety and alignment checks as first‑class constraints during the search.
Authors
- Zerun Ma
- Guoqiang Wang
- Xinchen Xie
- Yicheng Chen
- He Du
- Bowen Li
- Yanan Sun
- Wenran Liu
- Kai Chen
- Yining Li
Paper Information
- arXiv ID: 2604.14116v1
- Categories: cs.AI, cs.CL
- Published: April 15, 2026
- PDF: Download PDF