[Paper] Exploring Fine-Tuning for Tabular Foundation Models

Published: 3 weeks ago (January 14, 2026 at 12:40 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.09654v1

Overview

The paper Exploring Fine‑Tuning for Tabular Foundation Models investigates whether the impressive zero‑shot abilities of large language‑style models on structured (tabular) data can be further boosted by fine‑tuning. By systematically comparing zero‑shot inference, meta‑learning, full supervised fine‑tuning (SFT), and parameter‑efficient fine‑tuning (PEFT) across several public tabular benchmarks, the authors reveal when—and when not—fine‑tuning actually helps.

Key Contributions

First large‑scale empirical study of fine‑tuning strategies for Tabular Foundation Models (TFMs) on diverse benchmarks (TALENT, OpenML‑CC18, TabZilla).
Comprehensive comparison of four training regimes: Zero‑Shot, Meta‑Learning, Full Supervised Fine‑Tuning (SFT), and Parameter‑Efficient Fine‑Tuning (PEFT).
In‑depth analysis of how dataset characteristics (size, class imbalance, feature dimensionality) influence performance, calibration, and fairness after fine‑tuning.
Practical guidelines for practitioners on when fine‑tuning is likely to yield gains and when it may hurt accuracy or model reliability.
Open‑source evaluation framework (code and scripts) that can be reused for future TFM research.

Methodology

Models & Pre‑training – The authors use two publicly released TFMs (a decoder‑only transformer and an encoder‑decoder variant) that were pre‑trained on massive heterogeneous tabular corpora.
Benchmarks – Three representative suites:
- TALENT (heterogeneous classification/regression tasks)
- OpenML‑CC18 (a curated set of 18 classification problems with varying size/imbalance)
- TabZilla (large‑scale regression and classification tasks)
Fine‑tuning strategies
- Zero‑Shot – Prompt the model with a description of the task and let it predict directly.
- Meta‑Learning – Train a lightweight “adapter” across many tasks using a MAML‑style objective, then evaluate on unseen tasks.
- Full Supervised Fine‑Tuning (SFT) – Back‑propagate through all model parameters on the target dataset.
- Parameter‑Efficient Fine‑Tuning (PEFT) – Freeze the backbone and train only low‑rank adapters or LoRA modules.
Evaluation metrics – Accuracy/F1 (classification), RMSE (regression), Expected Calibration Error (ECE) for confidence quality, and demographic parity / equalized odds for fairness.
Statistical analysis – Paired bootstrap tests and regression analyses to link dataset factors (e.g., number of rows, class ratio, feature count) with observed gains or degradations.

Results & Findings

Strategy	Typical Δ Accuracy vs. Zero‑Shot	Calibration (ECE)	Fairness impact
Meta‑Learning	+2–5 % on small‑to‑medium datasets (≤ 5 k rows)	Slightly improved	Neutral
PEFT	+1–3 % on high‑dimensional (> 200 features) or highly imbalanced data	Comparable to Zero‑Shot	Minor gains for under‑represented groups
Full SFT	−1 % to −4 % on most benchmarks; occasional +3 % on very large, balanced datasets	Often worsened (higher ECE)	Can amplify bias when data is skewed
Zero‑Shot	Baseline (often already near‑state‑of‑the‑art)	Best overall calibration	Serves as a stable fairness reference

Dataset size matters: Fine‑tuning only yields consistent benefits when the target set exceeds ~10 k rows and is relatively balanced.
Feature dimensionality: PEFT shines on tasks with many columns because low‑rank adapters can capture cross‑feature interactions without over‑fitting.
Calibration: Zero‑Shot and PEFT retain the model’s well‑calibrated confidence scores; full SFT frequently degrades them, making downstream risk‑aware decisions harder.
Fairness: Meta‑Learning and PEFT modestly improve parity metrics on imbalanced datasets, whereas SFT can worsen disparity.

Practical Implications

Deployers can often skip fine‑tuning – If you have a pre‑trained TFM and a modestly sized tabular dataset, the zero‑shot prompt may already give you competitive results with less engineering overhead.
When to fine‑tune –
- Large, clean, balanced tables (≥ 10 k rows) – full SFT can edge out zero‑shot.
- High‑dimensional or heavily imbalanced data – use PEFT adapters (e.g., LoRA) to gain a few percentage points without sacrificing calibration.
Risk‑sensitive applications (credit scoring, medical triage) should prioritize calibration; the study suggests sticking with zero‑shot or PEFT rather than full SFT.
Fairness‑first pipelines – Incorporating a meta‑learning stage or PEFT can mitigate bias amplification that sometimes occurs with naive fine‑tuning.
Cost & latency – PEFT adds only a few thousand trainable parameters, meaning fine‑tuning can be done on a single GPU in minutes, whereas full SFT may require multi‑GPU resources and longer training cycles.

Limitations & Future Work

The analysis is limited to two TFM architectures; results may differ for newer, larger models or those trained on domain‑specific corpora.
Only three benchmark suites were examined; real‑world enterprise datasets with extreme sparsity or mixed data types (e.g., time‑series, text) remain unexplored.
The study focuses on supervised fine‑tuning; semi‑supervised or self‑training approaches could further close the gap for low‑label regimes.
Future research directions include: extending PEFT to multi‑task adapters, investigating continual‑learning scenarios, and developing automated tools that recommend the optimal fine‑tuning strategy based on dataset diagnostics.

Authors

Aditya Tanna
Pratinav Seth
Mohamed Bouadi
Vinay Kumar Sankarapu

Paper Information

arXiv ID: 2601.09654v1
Categories: cs.LG
Published: January 14, 2026
PDF: Download PDF

[Paper] Exploring Fine-Tuning for Tabular Foundation Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management