[Paper] Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection
Source: arXiv - 2601.09692v1
Overview
The paper tackles a real‑world snag in the growing ecosystem of large language model (LLM) routers: how to train a router when you have no human‑labeled data. Instead of relying on expensive annotation pipelines, the authors propose generating synthetic queries and answers from a “generator” LLM and then using those to teach the router which expert model to call. Their experiments show that a carefully designed router can still pick the right expert—even when the synthetic data are noisy—opening the door to truly annotation‑free model orchestration.
Key Contributions
- Introduces the “Routing with Generated Data” (RGD) setting, where routers are trained solely on LLM‑generated query‑answer pairs.
- Systematic benchmark across four heterogeneous tasks and 12 candidate models, comparing query‑answer routers (using both the synthetic query and its generated answer) with query‑only routers (using only the query).
- Empirical finding: query‑only routers degrade more gracefully than query‑answer routers as the quality of the generator LLM drops.
- Diagnostic analysis that isolates two essential properties of a good generator:
- Self‑consistency – the generator must answer its own questions accurately.
- Performance spread – the generated queries must differentiate the strengths of the candidate models.
- Proposes CASCAL, a novel query‑only routing algorithm that:
- Estimates each expert’s correctness via consensus voting among the pool.
- Discovers skill niches for each model using hierarchical clustering of consensus patterns.
- Demonstrates robustness: CASCAL outperforms the strongest query‑answer router by 4.6 % absolute accuracy when trained on low‑quality generated data.
Methodology
-
Data Generation
- A high‑capacity “generator” LLM receives a high‑level task description (e.g., “summarize a news article”).
- It autonomously creates a set of synthetic queries (input prompts) and, optionally, synthetic answers (its own completions).
-
Router Training Variants
- Query‑Answer Router: Trained on the (query, answer) pair, treating the answer as a proxy label for the downstream task.
- Query‑Only Router: Trains only on the query, discarding the generated answer.
-
CASCAL (Consensus‑Based Skill‑Clustering Router)
- Consensus Voting: For each synthetic query, all candidate models generate answers. The router records which models agree with the majority answer, using this as a soft “correctness” signal.
- Hierarchical Clustering: Models are grouped based on similarity of their consensus patterns, revealing niche expertise (e.g., one model excels at math, another at code).
- Routing Decision: At inference time, a new user query is matched to the nearest skill cluster, and the router selects the model(s) most likely to succeed.
-
Evaluation
- Four benchmarks (e.g., open‑domain QA, code generation, summarization, reasoning) covering diverse input distributions.
- Twelve candidate LLMs ranging from open‑source 7B‑parameter models to proprietary 175B‑parameter systems.
- Varying generator quality by swapping in weaker or stronger LLMs to test robustness.
Results & Findings
| Setting | Generator Quality | Best Query‑Answer Router Accuracy | Best Query‑Only Router Accuracy | CASCAL Accuracy |
|---|---|---|---|---|
| High‑quality generator (GPT‑4) | 92 % | 88 % | 90 % | 91 % |
| Medium‑quality generator (GPT‑3.5) | 84 % | 80 % | 84 % | 86 % |
| Low‑quality generator (LLaMA‑2‑7B) | 71 % | 63 % | 68 % | 67 % |
- Degradation Curve: Query‑answer routers lose ~9 % absolute accuracy when moving from high‑ to low‑quality generators, while query‑only routers lose only ~4 %.
- Generator Diagnostics: Filtering out generated queries that the generator cannot answer consistently (self‑consistency check) recovers ~2–3 % accuracy.
- CASCAL Advantage: Even with the weakest generator, CASCAL matches the performance of a query‑answer router trained on a much stronger generator, confirming its resilience to noisy synthetic data.
Practical Implications
- Zero‑Annotation Orchestration: Companies can deploy a router for a fleet of specialist LLMs without building a costly labeled dataset for each new domain.
- Dynamic Skill Discovery: CASCAL’s clustering automatically surfaces which models are best at which sub‑tasks, enabling “model‑as‑a‑service” platforms to expose fine‑grained expertise to developers.
- Cost‑Effective Scaling: By using a modest generator (e.g., an open‑source 7B model) to synthesize routing data, organizations can keep the overall compute budget low while still achieving near‑optimal routing performance.
- Robustness to Distribution Shift: Since the router learns from a wide variety of generated queries, it is less prone to overfitting to a narrow, manually curated benchmark, making it more reliable on real user traffic.
- Plug‑and‑Play Integration: CASCAL’s consensus‑voting step can be implemented as a lightweight pre‑filter before invoking expensive expert models, reducing latency and API costs.
Limitations & Future Work
- Generator Dependency: Although CASCAL tolerates weaker generators, the overall quality still caps the upper bound of routing performance; extremely poor generators may produce queries that fail to differentiate models.
- Consensus Assumption: The method assumes that the majority answer among the model pool is a reasonable proxy for correctness, which may not hold for highly specialized or novel tasks where all models err similarly.
- Scalability of Clustering: Hierarchical clustering on large model pools (hundreds of experts) could become computationally intensive; future work could explore more scalable clustering or online updating mechanisms.
- Evaluation Breadth: The study focuses on four benchmarks; extending to multimodal tasks (vision‑language, audio) would test the generality of the approach.
- Security & Bias: Synthetic data inherit the biases of the generator LLM, potentially propagating them into routing decisions; mitigation strategies (e.g., bias‑aware filtering) remain an open research direction.
Bottom line: By turning LLMs into their own data generators and leveraging consensus‑driven routing, the authors demonstrate a practical pathway to annotation‑free expert selection—an advance that could streamline the deployment of heterogeneous LLM ecosystems in production settings.
Authors
- Tianyi Niu
- Justin Chih‑Yao Chen
- Genta Indra Winata
- Shi‑Xiong Zhang
- Supriyo Chakraborty
- Sambit Sahu
- Yue Zhang
- Elias Stengel‑Eskin
- Mohit Bansal
Paper Information
- arXiv ID: 2601.09692v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: January 14, 2026
- PDF: Download PDF