[Paper] Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
Source: arXiv - 2604.21677v1
Overview
The paper introduces Geometric Monomial (GEM), a new family of activation functions that are smooth up to the (2N)-th derivative while still behaving like the popular ReLU. By using a log‑logistic cumulative distribution function (CDF) and purely rational arithmetic, GEM‑based activations can be evaluated efficiently on CPUs, GPUs, and even edge accelerators—yet they provide the gradient‑friendly properties that many modern architectures (CNNs, Vision Transformers, LLMs) crave.
Key Contributions
- (C^{2N})-smooth activation family – a mathematically grounded set of functions whose first (2N) derivatives are continuous, addressing ReLU’s non‑smooth “kink”.
- Three concrete variants
- GEM – the base smooth activation.
- E‑GEM – adds an (\varepsilon) scaling parameter that lets the function approximate ReLU arbitrarily well in any (L^{p}) norm.
- SE‑GEM – a piecewise version that guarantees no dead neurons while preserving the (C^{2N}) junction smoothness.
- Empirical “N‑ablation” study – shows that (N=1) is optimal for typical deep CNNs, while (N=2) works better for transformer‑style models.
- State‑of‑the‑art results on a range of benchmarks:
- CIFAR‑100 + ResNet‑56: GEM reduces the GELU gap from 6.10 % to 2.12 % (and E‑GEM to 0.62 %).
- CIFAR‑10 + ResNet‑56: SE‑GEM ((\varepsilon=10^{-4})) outperforms GELU (92.51 % vs 92.44 %).
- MNIST: E‑GEM matches the best baseline (99.23 %).
- GPT‑2 (124 M): GEM yields the lowest perplexity (72.57 vs 73.76 for GELU).
- BERT‑small: E‑GEM ((\varepsilon=10)) achieves the best validation loss (6.656).
Methodology
- Design of the gate – The activation’s “gate” follows a log‑logistic CDF, giving a smooth S‑shaped curve that can be expressed with simple rational functions (ratios of polynomials).
- Smoothness control via (N) – Raising the base rational expression to the power (N) yields a family continuously differentiable up to order (2N). In practice, (N=1) or (N=2) are enough to reap the benefits without heavy computational cost.
- (\varepsilon)-parameterization (E‑GEM) – Multiplying the input by a scale factor (\varepsilon) stretches or compresses the activation, allowing it to mimic ReLU as closely as desired in an (L^{p}) sense. Small (\varepsilon) values make the function steeper (more ReLU‑like), while larger values give a gentler, more “gelu‑ish” shape.
- Dead‑neuron protection (SE‑GEM) – The piecewise construction ensures that the derivative never hits zero for any finite input, eliminating the classic “dying ReLU” problem while keeping the (C^{2N}) smoothness at the junctions.
- Experimental protocol – Systematic ablation over (N) and (\varepsilon) across several model families (ResNet‑56, Vision Transformers, GPT‑2, BERT‑small) and datasets (MNIST, CIFAR‑10/100). Comparisons against standard activations (ReLU, GELU, Swish, Mish) use identical training pipelines to isolate the effect of the activation itself.
Results & Findings
| Model / Dataset | Activation | Accuracy / Perplexity / Loss | Notable Δ vs. GELU |
|---|---|---|---|
| ResNet‑56 (CIFAR‑100) | GEM (N=2) | – | ↓ 6.10 % gap |
| ResNet‑56 (CIFAR‑100) | E‑GEM (ε≈10⁻⁴) | – | ↓ 0.62 % gap |
| ResNet‑56 (CIFAR‑10) | SE‑GEM (ε=10⁻⁴) | 92.51 % | + 0.07 % over GELU |
| MNIST (simple MLP) | E‑GEM | 99.23 % | Ties best baseline |
| GPT‑2 (124 M) | GEM (N=1) | Perplexity 73.32 | Beats GELU (73.76) |
| GPT‑2 (124 M) | GEM (N=2) | Perplexity 72.57 | Best overall |
| BERT‑small | E‑GEM (ε=10) | Val‑loss 6.656 | Best among all tested activations |
Key takeaways
- Smoothness matters: Adding just one extra derivative continuity ((N=1)) already narrows the performance gap with GELU for deep CNNs.
- Task‑specific (\varepsilon): Small (\varepsilon) (≈10⁻⁴–10⁻⁶) works best for very deep convolutional stacks, whereas larger (\varepsilon) (≈10) benefits shallow transformer models where gradients are less constrained.
- No dead neurons: SE‑GEM consistently avoids the “dying” phenomenon without sacrificing accuracy, a practical win for production pipelines that monitor activation health.
Practical Implications
- Drop‑in replacement for ReLU/GELU – Because GEM, E‑GEM, and SE‑GEM are expressed with rational functions, they can be implemented with a handful of arithmetic ops and a single division—no exotic kernels or approximations are required. Existing deep‑learning frameworks (PyTorch, TensorFlow, JAX) can add them as custom ops with negligible overhead.
- Improved training stability – Higher‑order smoothness reduces gradient “shocks” at the activation boundary, leading to smoother loss curves and potentially fewer training restarts for very deep or large‑batch setups.
- Edge‑friendly inference – Rational arithmetic is friendly to integer‑only or fixed‑point hardware (e.g., microcontrollers, ASICs) because divisions can be approximated with multiplication by a pre‑computed reciprocal. This opens the door for smoother activations on latency‑critical inference workloads.
- Better transformer performance – The finding that (N=2) benefits transformer‑style models suggests that language‑model developers can experiment with GEM‑2 to squeeze a few perplexity points without changing the architecture or training schedule.
- Mitigating dead‑neuron bugs – SE‑GEM’s guarantee of non‑zero gradients eliminates a whole class of debugging headaches (e.g., layers that stop learning because all ReLUs have saturated to zero).
Limitations & Future Work
- Computational cost vs. ReLU – Although rational arithmetic is cheap, it is still more expensive than the single‑comparison ReLU. For ultra‑high‑throughput inference (e.g., serving billions of requests per day), the trade‑off must be measured.
- Hyper‑parameter sensitivity – The (\varepsilon) scale needs to be tuned per model family; the paper provides heuristics (small (\varepsilon) for deep CNNs, larger for shallow transformers) but an automated selection method is still missing.
- Limited architectural diversity – Experiments focus on ResNet‑56, standard Vision Transformers, GPT‑2, and BERT‑small. It remains to be seen how GEM behaves in newer architectures such as diffusion models, graph neural networks, or massive LLMs (e.g., 70 B+ parameters).
- Theoretical analysis of generalization – While smoothness is argued to aid optimization, a formal link to generalization error or robustness (e.g., adversarial resistance) is not explored.
Future directions could include: developing an adaptive (\varepsilon) schedule that changes during training, integrating GEM into hardware‑accelerated kernels, and extending the smoothness analysis to understand its impact on model calibration and uncertainty estimation.
Authors
- Eylon E. Krause
Paper Information
- arXiv ID: 2604.21677v1
- Categories: cs.LG, cs.AI, cs.NE
- Published: April 23, 2026
- PDF: Download PDF