[Paper] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Published: (November 26, 2025 at 12:50 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21081v1

Overview

The paper investigates a simple yet powerful tweak for Burmese news classification: swapping the usual dense‑layer (MLP) head with a Kolmogorov‑Arnold Network (KAN) head. By fine‑tuning only this classification layer on top of frozen embeddings (TF‑IDF, fastText, or multilingual transformers), the authors show that KANs can match or beat traditional MLPs while often being faster and more parameter‑efficient—an attractive proposition for low‑resource language projects.

Key Contributions

  • Introduced KAN‑based classification heads (FourierKAN, EfficientKAN, FasterKAN) for low‑resource text classification.
  • Benchmarked KAN heads against standard MLPs across four embedding families (TF‑IDF, fastText, mBERT, Distil‑mBERT).
  • Achieved state‑of‑the‑art F1‑score (0.928) on Burmese news classification using EfficientKAN + fastText.
  • Demonstrated a speed‑accuracy trade‑off: FasterKAN delivers near‑MLP performance with lower latency.
  • Provided an open‑source reproducible pipeline that can be adapted to any language with limited labeled data.

Methodology

  1. Data & Task – A curated Burmese news dataset (multiple categories) was split into train/validation/test sets.
  2. Embeddings – Four pre‑computed representations were used:
    • Sparse TF‑IDF vectors
    • Dense fastText word‑averages (pre‑trained on Burmese corpora)
    • Multilingual BERT (mBERT) and its distilled variant (Distil‑mBERT) – both frozen during training.
  3. Classification Heads – For each embedding, three KAN variants were instantiated:
    • FourierKAN – builds each neuron as a sum of Fourier basis functions.
    • EfficientKAN – uses spline‑based basis functions for compact, differentiable mappings.
    • FasterKAN – a grid‑based approximation that trades a tiny amount of expressivity for speed.
      The baseline head is a classic two‑layer MLP with ReLU activation.
  4. Training – Only the head parameters are fine‑tuned (≈ 1–2 % of total model parameters). Adam optimizer, early stopping on validation F1, and standard class‑balanced cross‑entropy loss were employed.
  5. Evaluation – Macro‑averaged F1, inference latency (ms per sample), and parameter count were recorded for each head‑embedding pair.

Results & Findings

EmbeddingHeadMacro F1Params (M)Inference (ms)
fastTextEfficientKAN0.9280.121.8
fastTextFasterKAN0.9210.091.2
fastTextMLP (baseline)0.9140.152.3
mBERTEfficientKAN0.9170.143.1
mBERTMLP0.9150.163.4
mBERTFasterKAN0.9100.112.8
TF‑IDFEfficientKAN0.8620.081.5
TF‑IDFMLP0.8580.101.7
Distil‑mBERTFasterKAN0.9040.122.5
  • Expressiveness: KAN heads consistently close the gap—or surpass—the MLP baseline, especially on fastText where the non‑linear spline basis captures subtle lexical patterns.
  • Efficiency: FasterKAN reduces inference time by ~30 % compared to MLPs while staying within 0.5 % F1 of the best model.
  • Robustness to Embedding Choice: Even with simple TF‑IDF vectors, KANs improve performance, indicating that the head’s functional form matters as much as the encoder.

Practical Implications

  • Low‑Resource Deployment: Teams building classifiers for under‑represented languages can keep large multilingual encoders frozen (saving GPU memory) and swap in a lightweight KAN head for a noticeable boost.
  • Edge & Mobile Scenarios: FasterKAN’s low parameter count and fast inference make it suitable for on‑device news categorization, chat‑bot intent detection, or content moderation where bandwidth is limited.
  • Rapid Prototyping: Because only the head is trained, experiments finish in minutes on a single GPU, enabling quick A/B testing of new label sets or domain shifts.
  • Transferability: The same KAN‑head architecture can be dropped onto any frozen embedding (e.g., CLIP for images, wav2vec for audio), opening doors for cross‑modal low‑resource tasks.

Limitations & Future Work

  • Frozen Encoder Assumption – The study does not explore joint fine‑tuning of the transformer; gains might be larger (or smaller) when the encoder is also updated.
  • Scalability to Very Large Label Spaces – Experiments were limited to ~10 news categories; performance on hundreds of classes remains untested.
  • Interpretability – While KANs are mathematically grounded, visualizing the learned spline/Fourier basis for text remains an open research question.
  • Broader Language Coverage – The authors plan to evaluate KAN heads on other low‑resource languages (e.g., Khmer, Lao) and on multilingual multi‑task settings.

Authors

  • Thura Aung
  • Eaint Kay Khaing Kyaw
  • Ye Kyaw Thu
  • Thazin Myint Oo
  • Thepchai Supnithi

Paper Information

  • arXiv ID: 2511.21081v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »