[Paper] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Published: 1 month ago (November 26, 2025 at 12:50 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21081v1

Overview

The paper investigates a simple yet powerful tweak for Burmese news classification: swapping the usual dense‑layer (MLP) head with a Kolmogorov‑Arnold Network (KAN) head. By fine‑tuning only this classification layer on top of frozen embeddings (TF‑IDF, fastText, or multilingual transformers), the authors show that KANs can match or beat traditional MLPs while often being faster and more parameter‑efficient—an attractive proposition for low‑resource language projects.

Key Contributions

Introduced KAN‑based classification heads (FourierKAN, EfficientKAN, FasterKAN) for low‑resource text classification.
Benchmarked KAN heads against standard MLPs across four embedding families (TF‑IDF, fastText, mBERT, Distil‑mBERT).
Achieved state‑of‑the‑art F1‑score (0.928) on Burmese news classification using EfficientKAN + fastText.
Demonstrated a speed‑accuracy trade‑off: FasterKAN delivers near‑MLP performance with lower latency.
Provided an open‑source reproducible pipeline that can be adapted to any language with limited labeled data.

Methodology

Data & Task – A curated Burmese news dataset (multiple categories) was split into train/validation/test sets.
Embeddings – Four pre‑computed representations were used:
- Sparse TF‑IDF vectors
- Dense fastText word‑averages (pre‑trained on Burmese corpora)
- Multilingual BERT (mBERT) and its distilled variant (Distil‑mBERT) – both frozen during training.
Classification Heads – For each embedding, three KAN variants were instantiated:
- FourierKAN – builds each neuron as a sum of Fourier basis functions.
- EfficientKAN – uses spline‑based basis functions for compact, differentiable mappings.
- FasterKAN – a grid‑based approximation that trades a tiny amount of expressivity for speed.
  The baseline head is a classic two‑layer MLP with ReLU activation.
Training – Only the head parameters are fine‑tuned (≈ 1–2 % of total model parameters). Adam optimizer, early stopping on validation F1, and standard class‑balanced cross‑entropy loss were employed.
Evaluation – Macro‑averaged F1, inference latency (ms per sample), and parameter count were recorded for each head‑embedding pair.

Results & Findings

Embedding	Head	Macro F1	Params (M)	Inference (ms)
fastText	EfficientKAN	0.928	0.12	1.8
fastText	FasterKAN	0.921	0.09	1.2
fastText	MLP (baseline)	0.914	0.15	2.3
mBERT	EfficientKAN	0.917	0.14	3.1
mBERT	MLP	0.915	0.16	3.4
mBERT	FasterKAN	0.910	0.11	2.8
TF‑IDF	EfficientKAN	0.862	0.08	1.5
TF‑IDF	MLP	0.858	0.10	1.7
Distil‑mBERT	FasterKAN	0.904	0.12	2.5

Expressiveness: KAN heads consistently close the gap—or surpass—the MLP baseline, especially on fastText where the non‑linear spline basis captures subtle lexical patterns.
Efficiency: FasterKAN reduces inference time by ~30 % compared to MLPs while staying within 0.5 % F1 of the best model.
Robustness to Embedding Choice: Even with simple TF‑IDF vectors, KANs improve performance, indicating that the head’s functional form matters as much as the encoder.

Practical Implications

Low‑Resource Deployment: Teams building classifiers for under‑represented languages can keep large multilingual encoders frozen (saving GPU memory) and swap in a lightweight KAN head for a noticeable boost.
Edge & Mobile Scenarios: FasterKAN’s low parameter count and fast inference make it suitable for on‑device news categorization, chat‑bot intent detection, or content moderation where bandwidth is limited.
Rapid Prototyping: Because only the head is trained, experiments finish in minutes on a single GPU, enabling quick A/B testing of new label sets or domain shifts.
Transferability: The same KAN‑head architecture can be dropped onto any frozen embedding (e.g., CLIP for images, wav2vec for audio), opening doors for cross‑modal low‑resource tasks.

Limitations & Future Work

Frozen Encoder Assumption – The study does not explore joint fine‑tuning of the transformer; gains might be larger (or smaller) when the encoder is also updated.
Scalability to Very Large Label Spaces – Experiments were limited to ~10 news categories; performance on hundreds of classes remains untested.
Interpretability – While KANs are mathematically grounded, visualizing the learned spline/Fourier basis for text remains an open research question.
Broader Language Coverage – The authors plan to evaluate KAN heads on other low‑resource languages (e.g., Khmer, Lao) and on multilingual multi‑task settings.

Authors

Thura Aung
Eaint Kay Khaing Kyaw
Ye Kyaw Thu
Thazin Myint Oo
Thepchai Supnithi

Paper Information

arXiv ID: 2511.21081v1
Categories: cs.CL, cs.AI, cs.LG
Published: November 26, 2025
PDF: Download PDF

[Paper] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] BanglaASTE: A Novel Framework for Aspect-Sentiment-Opinion Extraction in Bangla E-commerce Reviews Using Ensemble Deep Learning

[Paper] Developing an Open Conversational Speech Corpus for the Isan Language

[Paper] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

AI agents find $4.6M in blockchain smart contract exploits