[Paper] HAPS: Hierarchical LLM Routing with Joint Architecture and Parameter Search

Published: (January 9, 2026 at 11:22 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.05903v1

Overview

The paper presents HAPS, a hierarchical routing system that automatically picks the best large language model (LLM) and its optimal hyper‑parameters for a given task. By coupling architecture selection with parameter tuning, HAPS achieves higher accuracy than prior routing methods that only choose among model families.

Key Contributions

  • Joint Architecture‑Parameter Search: Introduces a two‑level router that first selects an LLM architecture and then fine‑tunes its parameters, rather than treating these decisions separately.
  • Parameter Generation Network (PGN): A shared network that produces candidate parameter settings for both routers, enabling knowledge transfer between architecture and parameter search.
  • Reward‑Augmented Training Objective: Combines task performance rewards with regularization terms to stabilize the hierarchical search and speed up convergence.
  • Empirical Validation: Demonstrates consistent gains on two standard LLM routing benchmarks, outperforming strong baselines such as Mixture‑of‑Experts routing and static model ensembles.
  • Open‑Source Release: Provides a ready‑to‑run implementation (https://github.com/zihangtian/HAPS) to facilitate reproducibility and downstream adoption.

Methodology

  1. Candidate Pool: A set of heterogeneous LLMs (e.g., GPT‑2‑medium, LLaMA‑7B, T5‑XL) is prepared, each with a configurable hyper‑parameter space (learning rate, prompt style, temperature, etc.).

  2. High‑Level Router: A lightweight classifier takes the task description (or input prompt) and outputs a probability distribution over the candidate architectures.

  3. Low‑Level Router: Conditioned on the architecture chosen by the high‑level router, this component selects a concrete parameter configuration from the space defined for that model.

  4. Parameter Generation Network: A neural network that, given the task embedding, generates a set of plausible hyper‑parameter vectors. Both routers query the PGN, allowing them to share learned “good‑parameter” patterns.

  5. Training Objective: The system is optimized with a reward‑augmented loss:

    • Task Reward: Negative log‑likelihood or task‑specific metric (e.g., BLEU, accuracy).
    • Regularization Reward: Encourages diversity among selected architectures and penalizes overly complex parameter settings.

    Gradient‑based updates are applied jointly to the routers and the PGN, using REINFORCE‑style estimators for the discrete routing decisions.

Results & Findings

BenchmarkBaseline (Static Best Model)Prior Routing (Mixture‑of‑Experts)HAPS
GLUE‑SuperGLUE84.2%86.7%88.5%
OpenAI‑Eval (multi‑turn QA)71.373.976.4
  • Performance Boost: HAPS improves average task scores by 2–3% over the strongest existing routing methods.
  • Parameter Efficiency: The selected configurations often use smaller learning rates and lower temperature settings, indicating that the joint search avoids over‑fitting.
  • Speed: Because the high‑level router quickly narrows the architecture pool, inference latency is comparable to using a single model, despite the underlying search.
  • Ablation: Removing the PGN or the reward‑augmented term drops performance by ~1.5%, confirming their importance.

Practical Implications

  • Dynamic Model Selection in Production: Services can automatically route user queries to the most cost‑effective LLM (e.g., a smaller model for simple intents, a larger one for complex reasoning) without manual tuning.
  • Reduced Engineering Overhead: Teams no longer need separate pipelines for architecture benchmarking and hyper‑parameter sweeps; HAPS handles both in a unified, data‑driven loop.
  • Cost Savings: By selecting the minimal‑sized model that meets performance targets, cloud compute spend can be lowered while preserving quality.
  • Plug‑and‑Play Integration: The open‑source code includes adapters for popular frameworks (Hugging Face Transformers, DeepSpeed), making it straightforward to embed HAPS into existing inference stacks.
  • Extensibility: The hierarchical design can be expanded to include hardware‑aware routing (GPU/TPU selection) or privacy constraints (on‑device vs. cloud models).

Limitations & Future Work

  • Scalability of Candidate Pool: The current experiments use a modest set of LLMs; scaling to dozens of models may increase the high‑level router’s training complexity.
  • Discrete Search Over Hyper‑Parameters: While the PGN generates continuous vectors, the final parameter choices are still discretized, which can miss fine‑grained optimal settings.
  • Task Generalization: HAPS is evaluated on benchmark suites; its ability to generalize to completely new domains (e.g., code generation) remains to be tested.
  • Future Directions: The authors suggest exploring multi‑objective routing (balancing latency, memory, and accuracy), incorporating reinforcement‑learning‑based exploration for larger model catalogs, and extending the framework to multimodal models.

Authors

  • Zihang Tian
  • Rui Li
  • Jingsen Zhang
  • Xiaohe Bo
  • Wei Huo
  • Xu Chen

Paper Information

  • arXiv ID: 2601.05903v1
  • Categories: cs.CL
  • Published: January 9, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »