[Paper] UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection

Published: 3 months ago (January 30, 2026 at 01:39 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.23273v1

Overview

The paper introduces UPA (Unsupervised Prompt Agent), a new way to automatically improve prompts for large language models (LLMs) without any labeled reward data. By treating prompt refinement as a tree‑structured search problem and leveraging only pairwise comparisons from the LLM itself, UPA can discover high‑quality prompts in fully unsupervised settings—something that was previously thought to require supervised feedback.

Key Contributions

Fully unsupervised prompt optimization: No human‑annotated scores or task‑specific rewards are needed.
Tree‑based exploration: Prompts are explored via an evolving search tree, allowing systematic coverage of the combinatorial prompt space.
Order‑invariant pairwise comparison: Uses fine‑grained LLM judgments (“Prompt A is better than Prompt B?”) that do not depend on absolute scoring.
Two‑stage selection framework:
1. Path‑wise Bayesian aggregation of local comparisons (via a Bradley‑Terry‑Luce model) to prune low‑confidence candidates.
2. Global tournament‑style comparisons to infer a latent quality ranking and pick the best prompt.
Empirical superiority: Across several benchmark tasks (e.g., text classification, reasoning, code generation), UPA outperforms state‑of‑the‑art supervised and unsupervised prompt‑search methods.

Methodology

Prompt Space as a Tree
- Each node represents a concrete prompt.
- Children are generated by applying simple edit operations (add/remove a sentence, swap wording, change temperature, etc.).
- Starting from a seed prompt, the tree expands iteratively, exploring diverse variations.
Local Pairwise Comparisons
- For any two sibling prompts, the LLM is asked a relative question: “Which of these two prompts yields a better answer for the task?”
- The LLM returns a binary preference; this is order‑invariant (it doesn’t need a numeric score).
Bayesian Aggregation (Stage 1)
- The Bradley‑Terry‑Luce (BTL) model treats each comparison as evidence about an underlying latent “prompt quality”.
- Using Bayesian inference, UPA computes a posterior distribution for each node’s quality and discards branches whose confidence is low, focusing the search on promising regions.
Global Tournament (Stage 2)
- The surviving candidates are pitted against each other in a round‑robin style tournament, again using LLM pairwise judgments.
- The BTL model aggregates these global comparisons to produce a final ranking, from which the top‑ranked prompt is selected.
Iterative Loop
- The process repeats: the best prompt becomes the new root, new edits are generated, and the two‑stage selection runs again until a stopping criterion (budget, convergence) is met.

Results & Findings

Task / Dataset	Baseline (Supervised)	UPA (Unsupervised)	Relative Gain
Sentiment Classification (SST‑2)	89.2 % accuracy (RL‑based prompt search)	91.5 %	+2.3 %
Multi‑choice QA (ARC‑Easy)	71.0 % (few‑shot prompting)	73.8 %	+2.8 %
Code Generation (HumanEval)	45.6 % pass@1 (gradient‑based prompt tuning)	48.9 %	+3.3 %
Open‑ended Reasoning (GSM‑8K)	68.4 % (self‑consistency)	70.7 %	+2.3 %

Consistency across domains: UPA’s advantage holds for classification, reasoning, and generation tasks.
Sample efficiency: With the same comparison budget (≈ 500 pairwise queries), UPA finds better prompts than supervised reinforcement‑learning agents that rely on human‑rated rewards.
Robustness to LLM noise: The Bayesian aggregation smooths out occasional inconsistent LLM judgments, leading to stable convergence.

Practical Implications

Who Benefits	How They Can Use UPA
Developers building AI‑powered products	Automatically tailor prompts for a specific UI or downstream API without hiring prompt engineers or collecting labeled reward data.
MLOps teams	Integrate UPA as a plug‑in in CI pipelines: each model version gets a fresh, unsupervised prompt search that adapts to subtle changes in model behavior.
LLM service providers	Offer “prompt‑as‑a‑service” where customers supply a seed prompt and UPA returns an optimized version, reducing support tickets caused by poor prompt performance.
Researchers	Use UPA as a baseline for studying prompt robustness; the tree‑search framework can be extended with domain‑specific edit operators (e.g., code syntax transformations).

Key take‑away: You no longer need a curated reward model or human‑in‑the‑loop labeling to get high‑quality prompts. UPA turns the LLM itself into a reliable judge, making prompt engineering scalable and cost‑effective.

Limitations & Future Work

Dependence on LLM consistency: If the underlying model gives highly contradictory pairwise answers (e.g., on ambiguous tasks), the BTL aggregation may struggle.
Search operator design: The current edit set is handcrafted; richer or task‑specific operators could further improve coverage but may increase the search space dramatically.
Scalability to very large prompt spaces: While the tree pruning helps, extremely high‑dimensional prompt representations could still require prohibitive numbers of comparisons.
Future directions suggested by the authors:
1. Learning adaptive edit operators via meta‑learning.
2. Combining unsupervised pairwise feedback with occasional cheap human checks to tighten the BTL estimates.
3. Extending the framework to multi‑modal prompts (e.g., text + image instructions).

UPA shows that sophisticated, agent‑style prompt optimization is viable even when you have no labeled reward data. For developers looking to squeeze extra performance out of existing LLMs, it offers a practical, plug‑and‑play solution that bridges the gap between research‑grade prompt tuning and production‑ready deployment.

Authors

Siran Peng
Weisong Zhao
Tianyu Fu
Chenxu Zhao
Tianshuo Zhang
Haoyuan Zhang
Xiangyu Zhu
Minghui Wu
Zhen Lei

Paper Information

arXiv ID: 2601.23273v1
Categories: cs.CL
Published: January 30, 2026
PDF: Download PDF

[Paper] UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound

[Paper] PaperBanana: Automating Academic Illustration for AI Scientists

[Paper] Agnostic Language Identification and Generation

[Paper] Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models