[Paper] Geometric Monomial (GEM): 유리 2N‑차 미분 가능 활성화 함수 패밀리

발행: 1일 전 (2026년 4월 23일 PM 10:42 GMT+9)

9 분 소요

원문: arXiv

Source: arXiv - 2604.21677v1

개요

이 논문은 **Geometric Monomial (GEM)**이라는 새로운 활성화 함수 계열을 소개합니다. 이 함수들은 (2N) 차 미분까지 매끄럽게 동작하면서도 인기 있는 ReLU와 같은 행동을 보입니다. 로그‑로지스틱 누적 분포 함수(CDF)와 순수한 유리 연산을 사용함으로써, GEM‑기반 활성화 함수는 CPU, GPU, 그리고 엣지 가속기에서도 효율적으로 평가될 수 있으며, 동시에 많은 최신 아키텍처(CNN, Vision Transformers, LLM)들이 필요로 하는 그래디언트‑친화적 특성을 제공합니다.

주요 기여

(C^{2N})-스무스 활성화 패밀리 – 첫 번째 (2N) 차 미분이 연속인 수학적으로 기반을 둔 함수 집합으로, ReLU의 비스무스 “꺾임(kink)”을 해결합니다.
세 가지 구체적인 변형
- GEM – 기본 스무스 활성화 함수.
- E‑GEM – (\varepsilon) 스케일링 파라미터를 추가하여 함수가 任의 (L^{p}) 노름에서 ReLU를 임의로 가깝게 근사하도록 합니다.
- SE‑GEM – 조각별 버전으로, *죽은 뉴런(dead neurons)*이 없도록 보장하면서 (C^{2N}) 접합 부드러움을 유지합니다.
실증적 “N‑절제(ablation)” 연구 – 일반적인 딥 CNN에서는 (N=1)이 최적이며, 트랜스포머 스타일 모델에서는 (N=2)가 더 좋은 성능을 보임을 보여줍니다.
다양한 벤치마크에서 최첨단 결과:
- CIFAR‑100 + ResNet‑56: GEM은 GELU 격차를 **6.10 %**에서 **2.12 %**로 감소시키고 (E‑GEM은 **0.62 %**까지 감소).
- CIFAR‑10 + ResNet‑56: SE‑GEM ((\varepsilon=10^{-4}))이 GELU를 능가함 (92.51 % vs 92.44 %).
- MNIST: E‑GEM이 최고 베이스라인과 일치 (99.23 %).
- GPT‑2 (124 M): GEM이 가장 낮은 퍼플렉시티를 기록 (72.57 vs 73.76 for GELU).
- BERT‑small: E‑GEM ((\varepsilon=10))이 최고의 검증 손실을 달성 (6.656).

Methodology

Design of the gate – The activation’s “gate” follows a log‑logistic CDF, giving a smooth S‑shaped curve that can be expressed with simple rational functions (ratios of polynomials).
Smoothness control via (N) – Raising the base rational expression to the power (N) yields a family continuously differentiable up to order (2N). In practice, (N=1) or (N=2) are enough to reap the benefits without heavy computational cost.
(\varepsilon)-parameterization (E‑GEM) – Multiplying the input by a scale factor (\varepsilon) stretches or compresses the activation, allowing it to mimic ReLU as closely as desired in an (L^{p}) sense. Small (\varepsilon) values make the function steeper (more ReLU‑like), while larger values give a gentler, more “gelu‑ish” shape.
Dead‑neuron protection (SE‑GEM) – The piecewise construction ensures that the derivative never hits zero for any finite input, eliminating the classic “dying ReLU” problem while keeping the (C^{2N}) smoothness at the junctions.
Experimental protocol – Systematic ablation over (N) and (\varepsilon) across several model families (ResNet‑56, Vision Transformers, GPT‑2, BERT‑small) and datasets (MNIST, CIFAR‑10/100). Comparisons against standard activations (ReLU, GELU, Swish, Mish) use identical training pipelines to isolate the effect of the activation itself.

결과 및 발견

Model / Dataset	Activation	Accuracy / Perplexity / Loss	Notable Δ vs. GELU
ResNet‑56 (CIFAR‑100)	GEM (N=2)	–	↓ 6.10 % 차이
ResNet‑56 (CIFAR‑100)	E‑GEM (ε≈10⁻⁴)	–	↓ 0.62 % 차이
ResNet‑56 (CIFAR‑10)	SE‑GEM (ε=10⁻⁴)	92.51 %	+ 0.07 % (GELU 대비)
MNIST (simple MLP)	E‑GEM	99.23 %	최고 기준과 동등
GPT‑2 (124 M)	GEM (N=1)	Perplexity 73.32	GELU(73.76)보다 우수
GPT‑2 (124 M)	GEM (N=2)	Perplexity 72.57	전체 최고
BERT‑small	E‑GEM (ε=10)	Val‑loss 6.656	테스트된 모든 활성화 함수 중 최고

주요 시사점

스무스함이 중요: 파생 연속성을 한 단계만 추가해도 ((N=1)) 깊은 CNN에서 GELU와의 성능 격차가 크게 줄어듭니다.
작업별 (\varepsilon): 매우 깊은 컨볼루션 스택에는 작은 (\varepsilon) (≈10⁻⁴–10⁻⁶)가 가장 잘 작동하고, 얕은 트랜스포머 모델에서는 큰 (\varepsilon) (≈10)가 그래디언트 제약이 적은 상황에 도움이 됩니다.
죽은 뉴런이 없음: SE‑GEM은 정확도를 희생하지 않으면서 “죽은 뉴런” 현상을 일관되게 방지합니다. 이는 활성화 상태를 모니터링하는 프로덕션 파이프라인에 실용적인 장점이 됩니다.

Practical Implications

Drop‑in replacement for ReLU/GELU – Because GEM, E‑GEM, and SE‑GEM are expressed with rational functions, they can be implemented with a handful of arithmetic ops and a single division—no exotic kernels or approximations are required. Existing deep‑learning frameworks (PyTorch, TensorFlow, JAX) can add them as custom ops with negligible overhead.
Improved training stability – Higher‑order smoothness reduces gradient “shocks” at the activation boundary, leading to smoother loss curves and potentially fewer training restarts for very deep or large‑batch setups.
Edge‑friendly inference – Rational arithmetic is friendly to integer‑only or fixed‑point hardware (e.g., microcontrollers, ASICs) because divisions can be approximated with multiplication by a pre‑computed reciprocal. This opens the door for smoother activations on latency‑critical inference workloads.
Better transformer performance – The finding that (N=2) benefits transformer‑style models suggests that language‑model developers can experiment with GEM‑2 to squeeze a few perplexity points without changing the architecture or training schedule.
Mitigating dead‑neuron bugs – SE‑GEM’s guarantee of non‑zero gradients eliminates a whole class of debugging headaches (e.g., layers that stop learning because all ReLUs have saturated to zero).

제한 사항 및 향후 연구

계산 비용 vs. ReLU – 유리 연산은 저렴하지만 여전히 단일 비교 ReLU보다 비용이 더 많이 듭니다. 초고속 추론(예: 하루에 수십억 건의 요청 처리)에서는 이 트레이드오프를 측정해야 합니다.
하이퍼파라미터 민감도 – (\varepsilon) 스케일은 모델 패밀리마다 조정이 필요합니다; 논문에서는 깊은 CNN에는 작은 (\varepsilon), 얕은 트랜스포머에는 큰 (\varepsilon)를 사용하는 휴리스틱을 제시하지만 자동 선택 방법은 아직 없습니다.
제한된 아키텍처 다양성 – 실험은 ResNet‑56, 표준 Vision Transformer, GPT‑2, BERT‑small에 집중되었습니다. 최신 아키텍처(예: 디퓨전 모델, 그래프 신경망, 혹은 70 B+ 파라미터 규모의 대형 LLM)에서 GEM이 어떻게 동작하는지는 아직 확인되지 않았습니다.
일반화에 대한 이론적 분석 – 부드러움이 최적화에 도움이 된다고 주장하지만, 일반화 오류나 강인성(예: 적대적 공격에 대한 저항)과의 정량적 연관성은 탐구되지 않았습니다.

향후 연구 방향으로는 훈련 중에 변하는 적응형 (\varepsilon) 스케줄 개발, GEM을 하드웨어 가속 커널에 통합, 그리고 부드러움 분석을 확장하여 모델 캘리브레이션 및 불확실성 추정에 미치는 영향을 이해하는 것이 포함될 수 있습니다.

저자

Eylon E. Krause

논문 정보

arXiv ID: 2604.21677v1
카테고리: cs.LG, cs.AI, cs.NE
발행일: 2026년 4월 23일
PDF: PDF 다운로드

[Paper] Geometric Monomial (GEM): 유리 2N‑차 미분 가능 활성화 함수 패밀리

개요

주요 기여

Methodology

결과 및 발견

주요 시사점

Practical Implications

제한 사항 및 향후 연구

저자

논문 정보

관련 글

[Paper] 빠르고 느린 것을 관찰하기: 비디오에서 시간 흐름 학습

[Paper] 스트리밍 지속 학습에서의 Temporal Taskification: 평가 불안정성의 원인

[Paper] Fine-Tuning 레짐은 구별되는 Continual Learning 문제를 정의한다

[Paper] 멀티캘리브레이션의 샘플 복잡도