[Paper] Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

Published: 1 day ago (April 27, 2026 at 01:26 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.24717v1

Overview

The paper “Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling” proposes a fresh way to enrich Transformer attention by turning the traditionally static rotary positional encoding (RoPE) into a learnable, signal‑driven component. By letting the rotation space adapt to timestamps, cyclical patterns, and categorical metadata, the authors demonstrate measurable gains on a large‑scale news‑feed recommendation task—without adding noticeable latency or memory cost.

Key Contributions

Re‑conceptualization of RoPE – Treats the rotation manifold as a second, orthogonal axis of representation (akin to an “imaginary” dimension) that can be learned rather than hand‑crafted.
SIREN‑RoPE architecture – Introduces a dual‑branch Sinusoidal Representation Network (SIREN) that injects heterogeneous signals (continuous time, periodic cycles, categorical tags) into the rotary encoding.
Unified semantic‑temporal embedding – Separates token meaning (semantic “real” part) from dynamic relational information (rotational “imaginary” part) within the same attention matrix.
Production‑scale validation – Shows consistent improvements on calibration (e.g., click‑through‑rate prediction reliability) and ranking metrics (NDCG, MAP) for a generative recommender serving billions of daily news feed impressions.
Negligible overhead – Demonstrates that the added SIREN branches add < 2 % extra FLOPs and < 1 % memory increase, making the approach practical for latency‑sensitive services.

Methodology

Baseline Transformer with RoPE – Standard self‑attention uses RoPE to encode token positions as a fixed rotation matrix derived from token indices.
Dual‑branch SIREN –
- Temporal branch receives raw timestamps (e.g., Unix epoch) and learns a smooth sinusoidal mapping via a small SIREN (a multilayer perceptron with sine activations).
- Semantic‑metadata branch consumes categorical features (e.g., article topic, user segment) encoded as embeddings and also passes them through a SIREN.
Learnable Rotary Matrix – The outputs of both branches are combined to produce a dynamic rotation angle for each token. This angle replaces the fixed index‑based angle in RoPE, effectively rotating each token’s query/key vectors in a signal‑conditioned way.
Integration into Attention – The rotated queries/keys are fed into the usual scaled‑dot‑product attention. No changes are required downstream (e.g., feed‑forward layers, loss functions).
Training – The entire system is trained end‑to‑end on the recommendation objective (a mixture of cross‑entropy for click prediction and pairwise ranking loss). The SIREN parameters are learned jointly with the rest of the model.

Results & Findings

Metric	Baseline (RoPE)	+SIREN‑RoPE	Δ (relative)
Click‑through‑rate (CTR) calibration (ECE)	0.112	0.098	‑12.5 %
NDCG@10	0.421	0.438	+4.0 %
MAP	0.357	0.371	+3.9 %
Inference latency (ms) per request	12.3	12.5	+0.2 %
GPU memory (MiB)	4,800	4,860	+1.3 %

Consistent gains across multiple downstream objectives indicate that the learned rotation captures useful temporal and contextual cues that static RoPE cannot.
Robustness: Improvements held across different traffic slices (e.g., peak vs. off‑peak hours) and for both cold‑start and long‑tail items.
Ablation: Removing either the temporal or metadata branch reduced the lift by roughly half, confirming that both signal types contribute meaningfully.

Practical Implications

Better time‑aware recommendations – Services that need to respect recency, periodicity (e.g., daily news cycles), or event‑driven spikes can now encode these signals directly into the attention mechanism without building separate time‑aware modules.
Lightweight upgrade path – Existing Transformer‑based pipelines (e.g., BERT, GPT, or custom ranking models) can adopt SIREN‑RoPE by swapping the RoPE layer; no architectural overhaul is required.
Improved model calibration – More reliable probability estimates translate to better A/B testing, budget allocation, and downstream decision‑making (e.g., throttling or fairness constraints).
Extensible to other domains – Any sequential task where auxiliary signals exist (speech with pitch contours, IoT streams with sensor IDs, code with version tags) can benefit from a learnable rotary space.
Minimal cost – The added compute and memory fit comfortably within typical production budgets, making it attractive for latency‑critical environments like real‑time recommendation or ad‑ranking.

Limitations & Future Work

Signal engineering required – The approach assumes that relevant auxiliary signals are available and can be pre‑processed into numeric form; missing or noisy metadata may limit benefits.
Scalability of SIREN depth – While the paper uses shallow SIREN networks to keep overhead low, deeper or wider variants could capture richer dynamics but may introduce latency trade‑offs that need careful profiling.
Generalization beyond news feeds – Experiments are confined to a single large‑scale news‑feed dataset; broader validation on other sequence tasks (e.g., language modeling, video captioning) is needed to confirm universality.
Theoretical understanding – The paper opens an intriguing “imaginary” axis in attention, but a formal analysis of why certain signal families improve attention alignment remains an open research question.

Future work could explore automated signal discovery (e.g., using meta‑learning to select which temporal or categorical cues to feed), hierarchical rotary encodings for multi‑scale sequences, and tighter integration with other positional schemes such as ALiBi or relative bias matrices.

Authors

Hailing Cheng
Daqi Sun
Xinyu Lu

Paper Information

arXiv ID: 2604.24717v1
Categories: cs.AI
Published: April 27, 2026
PDF: Download PDF

[Paper] Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

[Paper] Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

[Paper] Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models