[Paper] Provable Benefits of Sinusoidal Activation for Modular Addition

Published: (November 28, 2025 at 01:37 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2511.23443v1

Overview

The paper investigates how the choice of activation function influences a neural network’s ability to learn modular addition, a fundamental arithmetic operation that underlies many cryptographic and error‑correcting codes. By focusing on sine (sinusoidal) activations versus the ubiquitous ReLU, the authors reveal that sinusoidal networks can represent and generalize modular addition far more efficiently, both in terms of network size and required training data.

Key Contributions

  • Expressivity breakthrough: Proves that a two‑layer sine‑MLP with only two hidden units can exactly compute modular addition for any fixed input length, and with a bias term it can do so uniformly for all lengths.
  • ReLU limitation: Shows that ReLU networks need a hidden layer width that grows linearly with the input length (m) to achieve the same exactness, and they cannot simultaneously fit two different lengths that have distinct residues modulo the modulus (p).
  • Generalization theory: Introduces a novel Natarajan‑dimension bound for constant‑width sine networks, yielding a near‑optimal sample complexity of (\widetilde{O}(p)) for empirical risk minimization (ERM).
  • Margin‑based over‑parameterized analysis: Derives width‑independent, margin‑driven generalization guarantees for sine networks when they are heavily over‑parameterized.
  • Empirical validation: Demonstrates that sine‑activated networks consistently outperform ReLU counterparts on both interpolation (fitting training data) and extrapolation (predicting unseen sequence lengths), across a range of problem sizes.

Methodology

  1. Problem setup – The task is to learn the function
    [ f(x_1,\dots,x_m)=\bigl(\sum_{i=1}^m x_i\bigr) \bmod p, ]
    where each (x_i) is an integer in ({0,\dots,p-1}). The authors treat this as a classification problem with (p) possible outputs.

  2. Network architectures

    • Sine MLP: A two‑layer feed‑forward network where the hidden units apply (\sin(\cdot)) (or (\cos(\cdot))) followed by a linear read‑out.
    • ReLU MLP: Same depth but with the standard piecewise‑linear ReLU activation.
  3. Expressivity analysis – Using trigonometric identities (e.g., the discrete Fourier transform of the modular sum), they construct explicit weight settings that realize the exact modular addition mapping with only two sine units. For ReLU, they prove a lower bound on the necessary width via combinatorial arguments about linear regions.

  4. Generalization bounds

    • Natarajan dimension: They compute the Natarajan dimension of the hypothesis class defined by constant‑width sine networks, leading to a sample‑complexity bound that scales only with the modulus (p).
    • Margin analysis: In the over‑parameterized regime, they bound the Rademacher complexity using the network’s margin, showing that width does not appear in the final bound.
  5. Experiments – Synthetic datasets of modular addition are generated for varying lengths (m) and moduli (p). Both sine and ReLU networks are trained with standard SGD/Adam, and performance is measured on (a) interpolation (same length as training) and (b) extrapolation (longer lengths).

Results & Findings

SettingNetworkWidth needed for exact fitTest accuracy (interpolation)Test accuracy (extrapolation)
Fixed (m)Sine (2‑unit)2100 %100 % (even for unseen lengths)
Fixed (m)ReLU(\Theta(m))≈ 100 % (when width matches bound)Drops sharply for longer lengths
Varying (m)Sine (2‑unit + bias)2100 %100 % up to lengths far beyond training
Varying (m)ReLU(\Theta(m))100 % (only when width scales)Fails to generalize beyond trained length
  • Sample complexity: Empirical curves confirm the (\widetilde{O}(p)) scaling predicted by the Natarajan‑dimension bound—doubling (p) roughly doubles the number of training examples needed for a target error.
  • Margin effects: Networks trained with larger margins (via weight decay or explicit margin loss) exhibit tighter generalization, matching the theoretical margin‑based bound.
  • Robustness: Sine networks remain stable under noisy inputs and modest weight perturbations, whereas ReLU networks show higher variance in predictions.

Practical Implications

  1. Cryptography & Secure Computation – Many protocols require modular arithmetic (e.g., secret sharing, homomorphic encryption). Sine‑based neural surrogates could provide fast, differentiable approximations that retain exactness for small‑scale prototypes.

  2. Error‑correcting codes – Decoding algorithms often involve modular sums. Embedding sine‑MLPs into end‑to‑end learned decoders could reduce model size dramatically while preserving exact decoding logic.

  3. Resource‑constrained devices – The constant‑width, two‑unit sine network can be implemented with minimal memory and compute, making it attractive for microcontrollers or edge AI chips that need arithmetic reasoning.

  4. Neural architecture design – The work suggests a broader design principle: periodic activations can encode arithmetic structure more compactly than piecewise‑linear functions. Practitioners building models for tasks with inherent modular or cyclic patterns (e.g., time‑of‑day forecasting, robotics joint angles) might experiment with sinusoidal activations.

  5. Generalization‑focused training – The margin‑based analysis provides a concrete recipe (regularization, larger hidden‑layer norms) to achieve width‑independent generalization, useful when scaling models beyond the minimal architecture.

Limitations & Future Work

  • Scalability to large moduli: While the theory guarantees (\widetilde{O}(p)) samples, training becomes costly for cryptographically sized (p) (e.g., 2048‑bit). Efficient training tricks or hierarchical decompositions are needed.
  • Extension beyond addition: The paper focuses on modular addition; it remains open whether similar sinusoidal expressivity holds for multiplication, exponentiation, or more complex group operations.
  • Hardware considerations: Implementing high‑frequency sine activations on fixed‑point hardware may introduce quantization errors; exploring approximations (e.g., lookup tables or piecewise sinusoidal approximations) is a practical next step.
  • Broader activation families: Investigating other periodic functions (e.g., cosine, sawtooth, or learned Fourier bases) could reveal trade‑offs between expressivity, training stability, and hardware friendliness.

Overall, the study provides a compelling case for revisiting sinusoidal activations when the target task has an underlying modular or periodic structure, opening new avenues for compact, generalizable neural models in both research and production settings.

Authors

  • Tianlong Huang
  • Zhiyuan Li

Paper Information

  • arXiv ID: 2511.23443v1
  • Categories: cs.LG, stat.ML
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »