[Paper] Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection

Published: (January 14, 2026 at 01:43 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2601.09692v1

Overview

The paper tackles a real‑world snag in the growing ecosystem of large language model (LLM) routers: how to train a router when you have no human‑labeled data. Instead of relying on expensive annotation pipelines, the authors propose generating synthetic queries and answers from a “generator” LLM and then using those to teach the router which expert model to call. Their experiments show that a carefully designed router can still pick the right expert—even when the synthetic data are noisy—opening the door to truly annotation‑free model orchestration.

Key Contributions

  • Introduces the “Routing with Generated Data” (RGD) setting, where routers are trained solely on LLM‑generated query‑answer pairs.
  • Systematic benchmark across four heterogeneous tasks and 12 candidate models, comparing query‑answer routers (using both the synthetic query and its generated answer) with query‑only routers (using only the query).
  • Empirical finding: query‑only routers degrade more gracefully than query‑answer routers as the quality of the generator LLM drops.
  • Diagnostic analysis that isolates two essential properties of a good generator:
    1. Self‑consistency – the generator must answer its own questions accurately.
    2. Performance spread – the generated queries must differentiate the strengths of the candidate models.
  • Proposes CASCAL, a novel query‑only routing algorithm that:
    • Estimates each expert’s correctness via consensus voting among the pool.
    • Discovers skill niches for each model using hierarchical clustering of consensus patterns.
  • Demonstrates robustness: CASCAL outperforms the strongest query‑answer router by 4.6 % absolute accuracy when trained on low‑quality generated data.

Methodology

  1. Data Generation

    • A high‑capacity “generator” LLM receives a high‑level task description (e.g., “summarize a news article”).
    • It autonomously creates a set of synthetic queries (input prompts) and, optionally, synthetic answers (its own completions).
  2. Router Training Variants

    • Query‑Answer Router: Trained on the (query, answer) pair, treating the answer as a proxy label for the downstream task.
    • Query‑Only Router: Trains only on the query, discarding the generated answer.
  3. CASCAL (Consensus‑Based Skill‑Clustering Router)

    • Consensus Voting: For each synthetic query, all candidate models generate answers. The router records which models agree with the majority answer, using this as a soft “correctness” signal.
    • Hierarchical Clustering: Models are grouped based on similarity of their consensus patterns, revealing niche expertise (e.g., one model excels at math, another at code).
    • Routing Decision: At inference time, a new user query is matched to the nearest skill cluster, and the router selects the model(s) most likely to succeed.
  4. Evaluation

    • Four benchmarks (e.g., open‑domain QA, code generation, summarization, reasoning) covering diverse input distributions.
    • Twelve candidate LLMs ranging from open‑source 7B‑parameter models to proprietary 175B‑parameter systems.
    • Varying generator quality by swapping in weaker or stronger LLMs to test robustness.

Results & Findings

SettingGenerator QualityBest Query‑Answer Router AccuracyBest Query‑Only Router AccuracyCASCAL Accuracy
High‑quality generator (GPT‑4)92 %88 %90 %91 %
Medium‑quality generator (GPT‑3.5)84 %80 %84 %86 %
Low‑quality generator (LLaMA‑2‑7B)71 %63 %68 %67 %
  • Degradation Curve: Query‑answer routers lose ~9 % absolute accuracy when moving from high‑ to low‑quality generators, while query‑only routers lose only ~4 %.
  • Generator Diagnostics: Filtering out generated queries that the generator cannot answer consistently (self‑consistency check) recovers ~2–3 % accuracy.
  • CASCAL Advantage: Even with the weakest generator, CASCAL matches the performance of a query‑answer router trained on a much stronger generator, confirming its resilience to noisy synthetic data.

Practical Implications

  • Zero‑Annotation Orchestration: Companies can deploy a router for a fleet of specialist LLMs without building a costly labeled dataset for each new domain.
  • Dynamic Skill Discovery: CASCAL’s clustering automatically surfaces which models are best at which sub‑tasks, enabling “model‑as‑a‑service” platforms to expose fine‑grained expertise to developers.
  • Cost‑Effective Scaling: By using a modest generator (e.g., an open‑source 7B model) to synthesize routing data, organizations can keep the overall compute budget low while still achieving near‑optimal routing performance.
  • Robustness to Distribution Shift: Since the router learns from a wide variety of generated queries, it is less prone to overfitting to a narrow, manually curated benchmark, making it more reliable on real user traffic.
  • Plug‑and‑Play Integration: CASCAL’s consensus‑voting step can be implemented as a lightweight pre‑filter before invoking expensive expert models, reducing latency and API costs.

Limitations & Future Work

  • Generator Dependency: Although CASCAL tolerates weaker generators, the overall quality still caps the upper bound of routing performance; extremely poor generators may produce queries that fail to differentiate models.
  • Consensus Assumption: The method assumes that the majority answer among the model pool is a reasonable proxy for correctness, which may not hold for highly specialized or novel tasks where all models err similarly.
  • Scalability of Clustering: Hierarchical clustering on large model pools (hundreds of experts) could become computationally intensive; future work could explore more scalable clustering or online updating mechanisms.
  • Evaluation Breadth: The study focuses on four benchmarks; extending to multimodal tasks (vision‑language, audio) would test the generality of the approach.
  • Security & Bias: Synthetic data inherit the biases of the generator LLM, potentially propagating them into routing decisions; mitigation strategies (e.g., bias‑aware filtering) remain an open research direction.

Bottom line: By turning LLMs into their own data generators and leveraging consensus‑driven routing, the authors demonstrate a practical pathway to annotation‑free expert selection—an advance that could streamline the deployment of heterogeneous LLM ecosystems in production settings.

Authors

  • Tianyi Niu
  • Justin Chih‑Yao Chen
  • Genta Indra Winata
  • Shi‑Xiong Zhang
  • Supriyo Chakraborty
  • Sambit Sahu
  • Yue Zhang
  • Elias Stengel‑Eskin
  • Mohit Bansal

Paper Information

  • arXiv ID: 2601.09692v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: January 14, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »