[Paper] Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts

Published: 3 days ago (February 16, 2026 at 01:07 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.14490v1

Overview

The paper introduces Mixture of Space (MoS), a novel parameter‑efficient fine‑tuning (PEFT) framework that lets large language models (LLMs) represent data simultaneously in several geometric manifolds (e.g., Euclidean, hyperbolic, spherical). By extending the popular LoRA technique into MoSLoRA, the authors give LLMs the ability to pick the most suitable geometry for each token or context, leading to noticeably better performance on math‑heavy and reasoning benchmarks.

Key Contributions

Unified multi‑manifold PEFT: Proposes a Mixture of Space architecture that combines Euclidean, hyperbolic, and spherical experts within a single fine‑tuning layer.
MoSLoRA: Extends Low‑Rank Adaptation (LoRA) with heterogeneous geometric experts, preserving LoRA’s low‑parameter budget while adding curvature‑aware expressiveness.
Lightweight routing mechanism: Introduces a computationally cheap selector that decides which geometric expert(s) to activate for a given input, avoiding costly full‑manifold switches.
Curvature‑optimization insights: Provides empirical analysis of how learning curvature parameters affects training stability and downstream accuracy.
Strong empirical gains: Demonstrates consistent improvements over state‑of‑the‑art PEFT baselines, e.g., +5.6 % on MATH500 and +15.9 % on MAWPS, without increasing the number of trainable parameters.

Methodology

Geometric Experts – Each expert is a low‑rank adapter (as in LoRA) that lives on a specific manifold:
- Euclidean: standard linear transformations.
- Hyperbolic: uses the Poincaré ball model to capture hierarchical relationships.
- Spherical: embeds data on a unit sphere to model cyclic or periodic patterns.
Mixture Layer – For every token, the model computes a soft routing vector (via a tiny MLP) that assigns weights to the three experts. The final adaptation is a weighted sum of the expert outputs, allowing the model to blend geometries when needed.
Parameter Efficiency – Only the low‑rank matrices and the routing network are trainable; the base LLM weights stay frozen, keeping the total trainable parameter count comparable to vanilla LoRA (≈0.1 % of the full model).
Training Procedure
- Initialize curvature parameters (e.g., hyperbolic radius) and learn them jointly with the adapters.
- Apply standard cross‑entropy loss on downstream tasks; curvature updates are regularized to avoid numerical instability.
Implementation Tricks
- Use re‑parameterization to map Euclidean gradients onto the tangent spaces of non‑Euclidean manifolds.
- Cache manifold‑specific operations to reduce overhead during inference.

Results & Findings

Benchmark	Baseline (LoRA)	MoSLoRA	Relative Gain
MATH500	71.2 %	76.8 %	+5.6 %
MAWPS	42.3 %	58.2 %	+15.9 %
SST‑2	94.1 %	94.5 %	+0.4 %
WikiSQL	84.7 %	86.1 %	+1.4 %

Consistent wins across classification, reasoning, and retrieval‑augmented tasks.
Training stability improves when curvature parameters are regularized; the routing network converges within the same number of epochs as vanilla LoRA.
Parameter budget remains essentially unchanged (≈0.12 % of total model parameters).

Practical Implications

Plug‑and‑play fine‑tuning: Developers can replace a standard LoRA adapter with MoSLoRA in existing pipelines (e.g., Hugging Face peft library) without re‑training the whole model.
Better handling of hierarchical data: Applications such as knowledge‑graph completion, taxonomy classification, or code‑base navigation can benefit from the hyperbolic expert’s ability to capture tree‑like structures.
Improved reasoning for math/logic tasks: The spherical expert helps model cyclic patterns (e.g., periodic functions), while the mixture enables nuanced reasoning that single‑space adapters miss.
Low inference overhead: The routing network adds only a few microseconds per token, making MoSLoRA viable for latency‑sensitive services (chatbots, code assistants).
Future‑proofing: As new manifolds (e.g., product manifolds) become better understood, they can be added as additional experts without redesigning the whole PEFT stack.

Limitations & Future Work

Manifold selection limited to three spaces; more exotic geometries might further boost performance but increase routing complexity.
Curvature learning can be unstable on very deep adapters; the paper suggests stronger regularization or curriculum learning as possible fixes.
Benchmarks focus on English tasks; cross‑lingual or multimodal scenarios remain unexplored.
Routing interpretability: While the soft weights indicate which geometry is used, deeper analysis of why certain inputs prefer a given manifold is left for future research.

Authors

Buze Zhang
Jinkai Tao
Zilang Zeng
Neil He
Ali Maatouk
Menglin Yang
Rex Ying

Paper Information

arXiv ID: 2602.14490v1
Categories: cs.LG, cs.AI, cs.CL, cs.NE
Published: February 16, 2026
PDF: Download PDF

[Paper] Parameter-Efficient Fine-Tuning of LLMs with Mixture of Space Experts

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

[Paper] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

[Paper] Who can we trust? LLM-as-a-jury for Comparative Assessment

[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models