[Paper] Toward Manifest Relationality in Transformers via Symmetry Reduction

Published: 3 days ago (February 21, 2026 at 02:43 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.18948v1

Overview

The paper “Toward Manifest Relationality in Transformers via Symmetry Reduction” tackles a hidden source of inefficiency in modern transformer models: a lot of their internal parameters are redundant because they encode the same information in different coordinate frames or “heads.” By reformulating the model in terms of invariant relational quantities—features that stay the same under these symmetries—the authors show how to strip away the unnecessary degrees of freedom right from the start.

Key Contributions

Symmetry‑aware reformulation of token embeddings, attention scores, and layer‑norm operations as functions of relational (coordinate‑free) invariants.
Symmetry reduction framework that eliminates continuous symmetries both in model space (e.g., rotations of hidden vectors) and head space (permutations of attention heads).
Geometric interpretation of transformer dynamics that links optimization trajectories to movements on a reduced‑dimensional manifold.
Prototype relational transformer architecture that matches standard baselines while using up to ~30 % fewer parameters.
Analytical tools for quantifying parameter redundancy and for visualizing how training navigates the reduced symmetry space.

Methodology

Identify Symmetries – The authors first formalize two families of symmetries:
- Model‑space: any orthogonal transformation applied uniformly to all hidden vectors leaves the output unchanged.
- Head‑space: swapping or linearly mixing attention heads produces the same overall attention distribution.
Construct Invariant Quantities – Using concepts from group theory and differential geometry, they derive relational descriptors that are unchanged under the identified symmetries, such as inner‑product matrices between token embeddings and pairwise cosine similarities across heads.
Redefine Core Modules –
- Embedding layer: instead of absolute vectors, the model receives pairwise similarity tensors.
- Self‑attention: attention scores are computed directly from invariant pairwise relations, removing the need for separate query/key/value projections that are symmetric under rotations.
- Normalization: layer‑norm is replaced by a relational norm that operates on the invariant statistics of a token’s neighborhood.
Optimization on the Quotient Manifold – Training is performed using standard Adam, but gradients are projected onto the tangent space of the reduced manifold, guaranteeing that updates never re‑introduce the eliminated symmetries.
Empirical Validation – Experiments on language modeling (WikiText‑103) and vision‑language tasks (VQA) compare the relational transformer against a vanilla transformer of comparable depth, measuring perplexity, accuracy, and parameter count.

Results & Findings

Task	Model	Params (M)	Metric (↓ better)	Relative Δ
WikiText‑103 (LM)	Standard Transformer	125	18.9 ppl	–
	Relational Transformer	88	18.5 ppl	‑22 % params, +0.4 ppl
VQA	Standard Transformer‑BERT	110	66.2 % accuracy	–
	Relational Transformer‑BERT	85	66.8 %	‑23 % params, +0.6 % acc

Parameter efficiency: The relational version consistently uses ~20‑30 % fewer parameters while matching or slightly improving performance.
Training dynamics: Loss curves converge faster, and the projected gradients exhibit lower variance, suggesting smoother navigation of the reduced search space.
Interpretability: Visualizations of the invariant attention maps reveal clearer relational patterns (e.g., syntactic dependencies) that are harder to spot in the raw query/key space.

Practical Implications

Smaller, faster models – By cutting redundant parameters, developers can deploy transformers on edge devices or in latency‑critical services without sacrificing accuracy.
Simplified fine‑tuning – Since the relational representation is already symmetry‑free, fine‑tuning on downstream tasks requires fewer epochs and less hyper‑parameter tweaking.
Robustness to initialization – The reduced symmetry space mitigates “mode collapse” where different random seeds lead to wildly different internal representations, leading to more reproducible training outcomes.
Foundation for relational AI – The framework aligns naturally with graph‑based reasoning, knowledge‑graph integration, and multimodal tasks where relationships (rather than absolute embeddings) are the primary signal.
Tooling – The authors release a lightweight PyTorch library that plugs into existing transformer codebases, requiring only a few lines of model definition changes.

Limitations & Future Work

Scope of symmetries – The current reduction handles continuous orthogonal and head‑permutation symmetries but does not address discrete token‑ordering symmetries (e.g., positional encodings).
Computational overhead – Computing pairwise invariants scales quadratically with sequence length; the authors mitigate this with low‑rank approximations, but very long sequences (e.g., >8k tokens) still pose a challenge.
Generalization to other architectures – Extending the symmetry‑reduction principle to decoder‑only models (e.g., GPT) or to sparse‑attention variants remains an open question.
Theoretical guarantees – While empirical results are promising, formal proofs of convergence speedups on the quotient manifold are left for future work.

The paper opens a compelling path toward manifest relationality in deep learning models, offering a principled way to prune hidden redundancy and make transformer training both more efficient and more interpretable. As the community builds on these ideas, we can expect a new generation of leaner, geometry‑aware models that better align with the relational nature of real‑world data.

Authors

J. François
L. Ravera

Paper Information

arXiv ID: 2602.18948v1
Categories: cs.LG, cs.NE, hep-th, stat.ML
Published: February 21, 2026
PDF: Download PDF

[Paper] Toward Manifest Relationality in Transformers via Symmetry Reduction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A Very Big Video Reasoning Suite

[Paper] Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

[Paper] JUCAL: Jointly Calibrating Aleatoric and Epistemic Uncertainty in Classification Tasks

[Paper] Behavior Learning (BL): Learning Hierarchical Optimization Structures from Data