[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

Published: 3 days ago (February 20, 2026 at 01:35 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.18417v1

Overview

The paper proposes a unified way to build recurrent neural networks (RNNs) and transformer models whose hidden states live on mathematically well‑behaved groups—specifically closed subgroups of the unitary group (U(d)). By treating the choice of subgroup (e.g., the orthogonal group (O(d))) as a plug‑in component, the authors derive clean, “drop‑in‑replaceable” architectures that inherit desirable geometric properties such as norm preservation. Experiments on classic language modeling benchmarks show that these group‑constrained models can match or exceed standard baselines when the number of parameters is held constant.

Key Contributions

Unified group‑theoretic framework for both RNNs and transformers, built on a common skeleton where the hidden‑state space, tangent‑space projection, and update rule are parameterized by a subgroup of (U(d)).
Concrete instantiation for the orthogonal group (O(d)), yielding orthogonal‑state RNNs and transformers that maintain exact norm preservation throughout training.
Linear‑mixing extension in tangent space, a lightweight modification that works for any subgroup and improves performance under tight parameter budgets.
Empirical validation on Tiny Shakespeare and Penn Treebank, demonstrating that orthogonal‑state models achieve competitive perplexities with parameter‑matched baselines.
Open‑source implementation (released with the paper) that lets practitioners swap subgroups without rewriting model code.

Methodology

Skeleton formulation – The authors start with a minimal set of axioms describing a sequence model: a hidden state manifold (\mathcal{M}), a tangent‑space projection (\Pi), and an update map (\Phi).
Group substitution – By picking a closed subgroup (G \subseteq U(d)) (e.g., (O(d)) or the full unitary group), they let (\mathcal{M}=G). The tangent space at any point is the Lie algebra (\mathfrak{g}), and (\Pi) becomes the orthogonal projection onto (\mathfrak{g}).
RNN template – The recurrent update is expressed as a group multiplication:
[ h_{t+1}=h_t \exp\bigl(\Pi(\mathbf{W}x_t + \mathbf{U}h_t + b)\bigr), ]
where (\exp) is the matrix exponential mapping the projected vector back onto the group.
Transformer template – Self‑attention is reformulated so that query, key, and value vectors are elements of the Lie algebra, and the attention weights are applied via the group action (matrix multiplication) rather than additive residual connections.
Linear‑mixing tweak – Instead of feeding the raw projected vector directly into the exponential, they linearly mix it with a learned scalar, effectively scaling the step size in the tangent space. This simple change improves convergence when the model size is limited.

All of these steps are implemented with standard deep‑learning primitives (matrix multiplication, QR decomposition for re‑orthogonalization, etc.), making the approach easy to drop into existing PyTorch or JAX codebases.

Results & Findings

Model (≈ 1 M params)	Tiny Shakespeare (perplexity)	Penn Treebank (perplexity)
Standard LSTM	84.2	115.7
Orthogonal‑state RNN	78.5	108.3
Orthogonal‑state Transformer	80.1	110.9
Orthogonal‑state + Linear‑mixing (RNN)	76.3	106.1

Orthogonal‑state models consistently beat their unconstrained counterparts when the parameter budget is fixed, confirming that the geometric regularization is beneficial.
The linear‑mixing extension yields a 2–3 % relative improvement over the plain orthogonal version, especially on the smaller Tiny Shakespeare dataset.
Training stability improves: gradients remain well‑scaled, and the models exhibit fewer exploding/vanishing‑gradient incidents, thanks to the norm‑preserving property of the group.

Practical Implications

Plug‑and‑play stability – Developers can replace the hidden‑state representation in existing RNN or transformer code with an orthogonal (or other subgroup) version without redesigning the whole architecture. This can reduce the need for gradient clipping or learning‑rate tricks.
Memory‑efficient models – Because the group constraint eliminates the need for extra regularization terms (e.g., orthogonal penalties), you can achieve comparable performance with fewer parameters, which is valuable for edge devices or latency‑critical services.
Better long‑range modeling – Norm preservation helps maintain information over many time steps, making orthogonal‑state RNNs attractive for tasks like speech synthesis, time‑series forecasting, or reinforcement‑learning agents that require stable hidden dynamics.
Extensible to other groups – The framework is not limited to (O(d)); developers interested in complex‑valued networks, symplectic dynamics, or other Lie groups can experiment by swapping in a different subgroup, opening doors to domain‑specific inductive biases (e.g., physics‑informed models).

Limitations & Future Work

Computational overhead – Computing matrix exponentials (or their approximations) and re‑orthogonalizing after each step adds a modest constant factor to runtime compared with vanilla RNNs.
Scalability to very large models – The paper focuses on ≤ 2 M‑parameter models; it remains unclear how the approach behaves for transformer sizes in the hundreds of millions of parameters typical in production LLMs.
Limited subgroup exploration – Only the orthogonal group is empirically evaluated; other subgroups (e.g., unitary, special orthogonal) could offer different trade‑offs but were left for future study.
Tangent‑space linear mixing – While effective, the linear‑mixing heuristic lacks a formal theoretical justification; deeper analysis could reveal optimal step‑size schedules or adaptive schemes.

The authors suggest extending the framework to structured groups (e.g., block‑diagonal orthogonal matrices) and integrating it with modern training tricks like mixed‑precision and gradient checkpointing to mitigate the overhead.

Authors

Joshua Nunley

Paper Information

arXiv ID: 2602.18417v1
Categories: cs.LG, cs.CL
Published: February 20, 2026
PDF: Download PDF

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Validating Political Position Predictions of Arguments

[Paper] Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

[Paper] On the 'Induction Bias' in Sequence Models

[Paper] VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean