[Paper] Exploring Fine-Tuning for Tabular Foundation Models
Source: arXiv - 2601.09654v1
Overview
The paper Exploring Fine‑Tuning for Tabular Foundation Models investigates whether the impressive zero‑shot abilities of large language‑style models on structured (tabular) data can be further boosted by fine‑tuning. By systematically comparing zero‑shot inference, meta‑learning, full supervised fine‑tuning (SFT), and parameter‑efficient fine‑tuning (PEFT) across several public tabular benchmarks, the authors reveal when—and when not—fine‑tuning actually helps.
Key Contributions
- First large‑scale empirical study of fine‑tuning strategies for Tabular Foundation Models (TFMs) on diverse benchmarks (TALENT, OpenML‑CC18, TabZilla).
- Comprehensive comparison of four training regimes: Zero‑Shot, Meta‑Learning, Full Supervised Fine‑Tuning (SFT), and Parameter‑Efficient Fine‑Tuning (PEFT).
- In‑depth analysis of how dataset characteristics (size, class imbalance, feature dimensionality) influence performance, calibration, and fairness after fine‑tuning.
- Practical guidelines for practitioners on when fine‑tuning is likely to yield gains and when it may hurt accuracy or model reliability.
- Open‑source evaluation framework (code and scripts) that can be reused for future TFM research.
Methodology
- Models & Pre‑training – The authors use two publicly released TFMs (a decoder‑only transformer and an encoder‑decoder variant) that were pre‑trained on massive heterogeneous tabular corpora.
- Benchmarks – Three representative suites:
- TALENT (heterogeneous classification/regression tasks)
- OpenML‑CC18 (a curated set of 18 classification problems with varying size/imbalance)
- TabZilla (large‑scale regression and classification tasks)
- Fine‑tuning strategies
- Zero‑Shot – Prompt the model with a description of the task and let it predict directly.
- Meta‑Learning – Train a lightweight “adapter” across many tasks using a MAML‑style objective, then evaluate on unseen tasks.
- Full Supervised Fine‑Tuning (SFT) – Back‑propagate through all model parameters on the target dataset.
- Parameter‑Efficient Fine‑Tuning (PEFT) – Freeze the backbone and train only low‑rank adapters or LoRA modules.
- Evaluation metrics – Accuracy/F1 (classification), RMSE (regression), Expected Calibration Error (ECE) for confidence quality, and demographic parity / equalized odds for fairness.
- Statistical analysis – Paired bootstrap tests and regression analyses to link dataset factors (e.g., number of rows, class ratio, feature count) with observed gains or degradations.
Results & Findings
| Strategy | Typical Δ Accuracy vs. Zero‑Shot | Calibration (ECE) | Fairness impact |
|---|---|---|---|
| Meta‑Learning | +2–5 % on small‑to‑medium datasets (≤ 5 k rows) | Slightly improved | Neutral |
| PEFT | +1–3 % on high‑dimensional (> 200 features) or highly imbalanced data | Comparable to Zero‑Shot | Minor gains for under‑represented groups |
| Full SFT | −1 % to −4 % on most benchmarks; occasional +3 % on very large, balanced datasets | Often worsened (higher ECE) | Can amplify bias when data is skewed |
| Zero‑Shot | Baseline (often already near‑state‑of‑the‑art) | Best overall calibration | Serves as a stable fairness reference |
- Dataset size matters: Fine‑tuning only yields consistent benefits when the target set exceeds ~10 k rows and is relatively balanced.
- Feature dimensionality: PEFT shines on tasks with many columns because low‑rank adapters can capture cross‑feature interactions without over‑fitting.
- Calibration: Zero‑Shot and PEFT retain the model’s well‑calibrated confidence scores; full SFT frequently degrades them, making downstream risk‑aware decisions harder.
- Fairness: Meta‑Learning and PEFT modestly improve parity metrics on imbalanced datasets, whereas SFT can worsen disparity.
Practical Implications
- Deployers can often skip fine‑tuning – If you have a pre‑trained TFM and a modestly sized tabular dataset, the zero‑shot prompt may already give you competitive results with less engineering overhead.
- When to fine‑tune –
- Large, clean, balanced tables (≥ 10 k rows) – full SFT can edge out zero‑shot.
- High‑dimensional or heavily imbalanced data – use PEFT adapters (e.g., LoRA) to gain a few percentage points without sacrificing calibration.
- Risk‑sensitive applications (credit scoring, medical triage) should prioritize calibration; the study suggests sticking with zero‑shot or PEFT rather than full SFT.
- Fairness‑first pipelines – Incorporating a meta‑learning stage or PEFT can mitigate bias amplification that sometimes occurs with naive fine‑tuning.
- Cost & latency – PEFT adds only a few thousand trainable parameters, meaning fine‑tuning can be done on a single GPU in minutes, whereas full SFT may require multi‑GPU resources and longer training cycles.
Limitations & Future Work
- The analysis is limited to two TFM architectures; results may differ for newer, larger models or those trained on domain‑specific corpora.
- Only three benchmark suites were examined; real‑world enterprise datasets with extreme sparsity or mixed data types (e.g., time‑series, text) remain unexplored.
- The study focuses on supervised fine‑tuning; semi‑supervised or self‑training approaches could further close the gap for low‑label regimes.
- Future research directions include: extending PEFT to multi‑task adapters, investigating continual‑learning scenarios, and developing automated tools that recommend the optimal fine‑tuning strategy based on dataset diagnostics.
Authors
- Aditya Tanna
- Pratinav Seth
- Mohamed Bouadi
- Vinay Kumar Sankarapu
Paper Information
- arXiv ID: 2601.09654v1
- Categories: cs.LG
- Published: January 14, 2026
- PDF: Download PDF