[Paper] TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

Published: 3 weeks ago (April 15, 2026 at 01:38 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.14116v1

Overview

The paper presents TREX, a multi‑agent framework that automates the full life‑cycle of fine‑tuning large language models (LLMs). By treating the iterative fine‑tuning process as a tree‑structured search, TREX can plan, execute, and learn from multiple training experiments without human intervention, showing consistent performance gains across a suite of real‑world tasks.

Key Contributions

Agent‑driven pipeline – Introduces two cooperating agents (Researcher & Executor) that handle everything from requirement gathering to model evaluation.
Tree‑based exploration – Models multi‑round fine‑tuning as a searchable tree, enabling systematic planning, result reuse, and high‑level insight extraction.
FT‑Bench – A new benchmark of 10 realistic fine‑tuning scenarios (e.g., capability upgrades, domain‑specific adaptation) to evaluate automated training systems.
Empirical validation – Demonstrates that TREX outperforms baseline manual and naïve automated pipelines on all FT‑Bench tasks.
Open‑source potential – The architecture is modular, making it straightforward to plug in different LLM back‑ends, data sources, or evaluation metrics.

Methodology

Problem framing – Fine‑tuning is cast as a sequential decision problem: each experiment (choice of data, hyper‑parameters, curriculum, etc.) leads to a new state (model performance).
Researcher agent
- Parses a high‑level user requirement (e.g., “improve medical QA”).
- Conducts open‑domain literature and data searches, curates candidate datasets, and proposes a training strategy (data mix, learning rate schedule, etc.).
Executor agent
- Materializes the Researcher’s plan: builds data pipelines, launches training jobs, and collects evaluation metrics.
- Returns results and logs back to the Researcher.
Tree‑based search
- Each node represents a specific fine‑tuning configuration and its outcome.
- The system expands promising nodes, prunes under‑performing branches, and re‑uses artifacts (e.g., pre‑processed datasets) across branches.
- A lightweight meta‑learner distills patterns from visited nodes to guide future proposals (e.g., “learning rate 2e‑5 works well for domain X”).
Iterative loop – The agents repeat the propose‑execute‑evaluate cycle until a stopping criterion (budget, convergence, or target metric) is met.

Results & Findings

FT‑Bench Task	Baseline (manual)	Naïve Auto‑Tune	TREX (best leaf)
General QA improvement	+3.2 % EM	+4.1 % EM	+6.8 % EM
Legal document summarization	+2.5 % ROUGE‑L	+3.0 % ROUGE‑L	+5.4 % ROUGE‑L
Code generation (Python)	+1.8 % Pass@1	+2.2 % Pass@1	+4.7 % Pass@1
… (7 more)	…	…	…

Consistent gains: TREX outperformed both human‑crafted baselines and a simple grid‑search auto‑tuner on every task.
Efficiency: By re‑using data recipes and pruning low‑yield branches, TREX reduced total GPU hours by ~30 % compared to exhaustive search.
Insight extraction: The meta‑learner surfaced actionable rules (e.g., “mix 70 % domain data with 30 % general data for legal tasks”) that were later verified by the authors in a separate ablation study.

Practical Implications

Rapid prototyping – Teams can feed a high‑level goal (e.g., “boost sentiment analysis on product reviews”) and let TREX generate a fine‑tuned model without hand‑crafting data pipelines or hyper‑parameter sweeps.
Cost‑effective scaling – The tree‑search reuses intermediate artifacts, cutting down on redundant preprocessing and training runs, which translates to lower cloud compute bills.
Continuous improvement loops – TREX can be hooked into CI/CD pipelines for LLM products, automatically re‑training when new data arrives or when performance drifts.
Democratizing LLM customization – Smaller organizations lacking deep ML expertise can leverage the agent system to obtain domain‑adapted models that would otherwise require specialist effort.
Integration points – The modular agents can be swapped for proprietary data crawlers, internal evaluation suites, or custom hardware schedulers, making TREX adaptable to existing MLOps stacks.

Limitations & Future Work

Search space explosion – Although the tree pruning mitigates it, extremely large hyper‑parameter or data‑mix spaces can still overwhelm the system without tighter priors.
Dependency on quality of external data – The Researcher’s literature and dataset mining relies on open‑source resources; noisy or biased sources can propagate into the fine‑tuned model.
Evaluation bottleneck – Accurate assessment of each leaf often requires running the model on task‑specific benchmarks, which can be time‑consuming for large models.
Future directions suggested by the authors include:
1. Incorporating reinforcement‑learning‑based policy search to better balance exploration vs. exploitation.
2. Extending TREX to handle multi‑modal models (e.g., vision‑language).
3. Adding safety and alignment checks as first‑class constraints during the search.

Authors

Zerun Ma
Guoqiang Wang
Xinchen Xie
Yicheng Chen
He Du
Bowen Li
Yanan Sun
Wenran Liu
Kai Chen
Yining Li

Paper Information

arXiv ID: 2604.14116v1
Categories: cs.AI, cs.CL
Published: April 15, 2026
PDF: Download PDF

[Paper] TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

[Paper] Detecting and Suppressing Reward Hacking with Gradient Fingerprints