[Paper] Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts
Source: arXiv - 2602.14490v1
Overview
The paper introduces Mixture of Space (MoS), a novel parameter‑efficient fine‑tuning (PEFT) framework that lets large language models (LLMs) represent data simultaneously in several geometric manifolds (e.g., Euclidean, hyperbolic, spherical). By extending the popular LoRA technique into MoSLoRA, the authors give LLMs the ability to pick the most suitable geometry for each token or context, leading to noticeably better performance on math‑heavy and reasoning benchmarks.
Key Contributions
- Unified multi‑manifold PEFT: Proposes a Mixture of Space architecture that combines Euclidean, hyperbolic, and spherical experts within a single fine‑tuning layer.
- MoSLoRA: Extends Low‑Rank Adaptation (LoRA) with heterogeneous geometric experts, preserving LoRA’s low‑parameter budget while adding curvature‑aware expressiveness.
- Lightweight routing mechanism: Introduces a computationally cheap selector that decides which geometric expert(s) to activate for a given input, avoiding costly full‑manifold switches.
- Curvature‑optimization insights: Provides empirical analysis of how learning curvature parameters affects training stability and downstream accuracy.
- Strong empirical gains: Demonstrates consistent improvements over state‑of‑the‑art PEFT baselines, e.g., +5.6 % on MATH500 and +15.9 % on MAWPS, without increasing the number of trainable parameters.
Methodology
-
Geometric Experts – Each expert is a low‑rank adapter (as in LoRA) that lives on a specific manifold:
- Euclidean: standard linear transformations.
- Hyperbolic: uses the Poincaré ball model to capture hierarchical relationships.
- Spherical: embeds data on a unit sphere to model cyclic or periodic patterns.
-
Mixture Layer – For every token, the model computes a soft routing vector (via a tiny MLP) that assigns weights to the three experts. The final adaptation is a weighted sum of the expert outputs, allowing the model to blend geometries when needed.
-
Parameter Efficiency – Only the low‑rank matrices and the routing network are trainable; the base LLM weights stay frozen, keeping the total trainable parameter count comparable to vanilla LoRA (≈0.1 % of the full model).
-
Training Procedure
- Initialize curvature parameters (e.g., hyperbolic radius) and learn them jointly with the adapters.
- Apply standard cross‑entropy loss on downstream tasks; curvature updates are regularized to avoid numerical instability.
-
Implementation Tricks
- Use re‑parameterization to map Euclidean gradients onto the tangent spaces of non‑Euclidean manifolds.
- Cache manifold‑specific operations to reduce overhead during inference.
Results & Findings
| Benchmark | Baseline (LoRA) | MoSLoRA | Relative Gain |
|---|---|---|---|
| MATH500 | 71.2 % | 76.8 % | +5.6 % |
| MAWPS | 42.3 % | 58.2 % | +15.9 % |
| SST‑2 | 94.1 % | 94.5 % | +0.4 % |
| WikiSQL | 84.7 % | 86.1 % | +1.4 % |
- Consistent wins across classification, reasoning, and retrieval‑augmented tasks.
- Training stability improves when curvature parameters are regularized; the routing network converges within the same number of epochs as vanilla LoRA.
- Parameter budget remains essentially unchanged (≈0.12 % of total model parameters).
Practical Implications
- Plug‑and‑play fine‑tuning: Developers can replace a standard LoRA adapter with MoSLoRA in existing pipelines (e.g., Hugging Face
peftlibrary) without re‑training the whole model. - Better handling of hierarchical data: Applications such as knowledge‑graph completion, taxonomy classification, or code‑base navigation can benefit from the hyperbolic expert’s ability to capture tree‑like structures.
- Improved reasoning for math/logic tasks: The spherical expert helps model cyclic patterns (e.g., periodic functions), while the mixture enables nuanced reasoning that single‑space adapters miss.
- Low inference overhead: The routing network adds only a few microseconds per token, making MoSLoRA viable for latency‑sensitive services (chatbots, code assistants).
- Future‑proofing: As new manifolds (e.g., product manifolds) become better understood, they can be added as additional experts without redesigning the whole PEFT stack.
Limitations & Future Work
- Manifold selection limited to three spaces; more exotic geometries might further boost performance but increase routing complexity.
- Curvature learning can be unstable on very deep adapters; the paper suggests stronger regularization or curriculum learning as possible fixes.
- Benchmarks focus on English tasks; cross‑lingual or multimodal scenarios remain unexplored.
- Routing interpretability: While the soft weights indicate which geometry is used, deeper analysis of why certain inputs prefer a given manifold is left for future research.
Authors
- Buze Zhang
- Jinkai Tao
- Zilang Zeng
- Neil He
- Ali Maatouk
- Menglin Yang
- Rex Ying
Paper Information
- arXiv ID: 2602.14490v1
- Categories: cs.LG, cs.AI, cs.CL, cs.NE
- Published: February 16, 2026
- PDF: Download PDF