[Paper] Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection

Published: 3 weeks ago (January 14, 2026 at 01:43 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.09692v1

Overview

The paper tackles a real‑world snag in the growing ecosystem of large language model (LLM) routers: how to train a router when you have no human‑labeled data. Instead of relying on expensive annotation pipelines, the authors propose generating synthetic queries and answers from a “generator” LLM and then using those to teach the router which expert model to call. Their experiments show that a carefully designed router can still pick the right expert—even when the synthetic data are noisy—opening the door to truly annotation‑free model orchestration.

Key Contributions

Introduces the “Routing with Generated Data” (RGD) setting, where routers are trained solely on LLM‑generated query‑answer pairs.
Systematic benchmark across four heterogeneous tasks and 12 candidate models, comparing query‑answer routers (using both the synthetic query and its generated answer) with query‑only routers (using only the query).
Empirical finding: query‑only routers degrade more gracefully than query‑answer routers as the quality of the generator LLM drops.
Diagnostic analysis that isolates two essential properties of a good generator:
1. Self‑consistency – the generator must answer its own questions accurately.
2. Performance spread – the generated queries must differentiate the strengths of the candidate models.
Proposes CASCAL, a novel query‑only routing algorithm that:
- Estimates each expert’s correctness via consensus voting among the pool.
- Discovers skill niches for each model using hierarchical clustering of consensus patterns.
Demonstrates robustness: CASCAL outperforms the strongest query‑answer router by 4.6 % absolute accuracy when trained on low‑quality generated data.

Methodology

Data Generation
- A high‑capacity “generator” LLM receives a high‑level task description (e.g., “summarize a news article”).
- It autonomously creates a set of synthetic queries (input prompts) and, optionally, synthetic answers (its own completions).
Router Training Variants
- Query‑Answer Router: Trained on the (query, answer) pair, treating the answer as a proxy label for the downstream task.
- Query‑Only Router: Trains only on the query, discarding the generated answer.
CASCAL (Consensus‑Based Skill‑Clustering Router)
- Consensus Voting: For each synthetic query, all candidate models generate answers. The router records which models agree with the majority answer, using this as a soft “correctness” signal.
- Hierarchical Clustering: Models are grouped based on similarity of their consensus patterns, revealing niche expertise (e.g., one model excels at math, another at code).
- Routing Decision: At inference time, a new user query is matched to the nearest skill cluster, and the router selects the model(s) most likely to succeed.
Evaluation
- Four benchmarks (e.g., open‑domain QA, code generation, summarization, reasoning) covering diverse input distributions.
- Twelve candidate LLMs ranging from open‑source 7B‑parameter models to proprietary 175B‑parameter systems.
- Varying generator quality by swapping in weaker or stronger LLMs to test robustness.

Results & Findings

Setting	Generator Quality	Best Query‑Answer Router Accuracy	Best Query‑Only Router Accuracy	CASCAL Accuracy
High‑quality generator (GPT‑4)	92 %	88 %	90 %	91 %
Medium‑quality generator (GPT‑3.5)	84 %	80 %	84 %	86 %
Low‑quality generator (LLaMA‑2‑7B)	71 %	63 %	68 %	67 %

Degradation Curve: Query‑answer routers lose ~9 % absolute accuracy when moving from high‑ to low‑quality generators, while query‑only routers lose only ~4 %.
Generator Diagnostics: Filtering out generated queries that the generator cannot answer consistently (self‑consistency check) recovers ~2–3 % accuracy.
CASCAL Advantage: Even with the weakest generator, CASCAL matches the performance of a query‑answer router trained on a much stronger generator, confirming its resilience to noisy synthetic data.

Practical Implications

Zero‑Annotation Orchestration: Companies can deploy a router for a fleet of specialist LLMs without building a costly labeled dataset for each new domain.
Dynamic Skill Discovery: CASCAL’s clustering automatically surfaces which models are best at which sub‑tasks, enabling “model‑as‑a‑service” platforms to expose fine‑grained expertise to developers.
Cost‑Effective Scaling: By using a modest generator (e.g., an open‑source 7B model) to synthesize routing data, organizations can keep the overall compute budget low while still achieving near‑optimal routing performance.
Robustness to Distribution Shift: Since the router learns from a wide variety of generated queries, it is less prone to overfitting to a narrow, manually curated benchmark, making it more reliable on real user traffic.
Plug‑and‑Play Integration: CASCAL’s consensus‑voting step can be implemented as a lightweight pre‑filter before invoking expensive expert models, reducing latency and API costs.

Limitations & Future Work

Generator Dependency: Although CASCAL tolerates weaker generators, the overall quality still caps the upper bound of routing performance; extremely poor generators may produce queries that fail to differentiate models.
Consensus Assumption: The method assumes that the majority answer among the model pool is a reasonable proxy for correctness, which may not hold for highly specialized or novel tasks where all models err similarly.
Scalability of Clustering: Hierarchical clustering on large model pools (hundreds of experts) could become computationally intensive; future work could explore more scalable clustering or online updating mechanisms.
Evaluation Breadth: The study focuses on four benchmarks; extending to multimodal tasks (vision‑language, audio) would test the generality of the approach.
Security & Bias: Synthetic data inherit the biases of the generator LLM, potentially propagating them into routing decisions; mitigation strategies (e.g., bias‑aware filtering) remain an open research direction.

Bottom line: By turning LLMs into their own data generators and leveraging consensus‑driven routing, the authors demonstrate a practical pathway to annotation‑free expert selection—an advance that could streamline the deployment of heterogeneous LLM ecosystems in production settings.

Authors

Tianyi Niu
Justin Chih‑Yao Chen
Genta Indra Winata
Shi‑Xiong Zhang
Supriyo Chakraborty
Sambit Sahu
Yue Zhang
Elias Stengel‑Eskin
Mohit Bansal

Paper Information

arXiv ID: 2601.09692v1
Categories: cs.CL, cs.AI, cs.LG
Published: January 14, 2026
PDF: Download PDF

[Paper] Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models