[Paper] TabICLv2: A better, faster, scalable, and open tabular foundation model

Published: (February 11, 2026 at 01:51 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.11139v1

Overview

TabICLv2 is the latest “foundation model” for tabular data, pushing the limits of what large‑scale, pre‑trained models can do on spreadsheets, CSVs, and relational tables. By combining a richer synthetic data generator, smarter architecture tweaks, and a new optimizer, the authors show that a single model can beat heavily‑tuned ensembles on both regression and classification tasks—while staying fast enough to run on a single GPU with < 50 GB memory.

Key Contributions

  • Diverse synthetic pre‑training engine – automatically creates millions of varied tabular datasets (different column types, missingness patterns, feature interactions) to expose the model to a broad “world” of tables.
  • Scalable softmax‑in‑attention – a novel attention formulation that keeps the computational cost low for long feature sequences, enabling the model to handle millions of rows without exploding memory.
  • Muon optimizer – replaces the standard AdamW during pre‑training, delivering faster convergence and better generalisation on downstream tabular tasks.
  • State‑of‑the‑art performance – on the TabArena and TALENT benchmarks, TabICLv2 outperforms RealTabPFN‑2.5 even though the latter uses hyper‑parameter tuning, ensembling, and fine‑tuning on real data.
  • Open‑source release – inference code and pretrained weights are publicly available, with the synthetic data engine and training scripts promised soon.

Methodology

1. Synthetic Data Generation

  • The authors built a pipeline that samples random schemas (numeric, categorical, datetime, text), injects realistic noise (missing values, outliers), and creates target variables using a mix of linear, tree‑based, and neural functions.
  • This yields a high‑diversity pre‑training corpus that mimics the heterogeneity seen in real‑world tables, reducing the need for massive labeled datasets.

2. Model Architecture

  • TabICLv2 is a transformer‑style encoder that treats each column as a token and each row as a “sequence”.
  • The scalable softmax‑in‑attention computes attention over rows in a chunked fashion, avoiding the quadratic blow‑up of classic self‑attention while preserving the ability to capture long‑range dependencies across rows.

3. Training Protocol

  • Pre‑training runs for a modest number of steps (relative to earlier TabPFN models) using the Muon optimizer, which adapts learning rates per‑parameter more aggressively than AdamW.
  • No task‑specific fine‑tuning is performed; the model is evaluated directly via in‑context learning: a few example rows + a query row are fed to the model, which predicts the target.

4. Evaluation

  • Benchmarks: TabArena (a collection of 100+ public tabular datasets) and TALENT (large‑scale, million‑row tables).
  • Metrics: standard regression (RMSE, R²) and classification (accuracy, F1) scores, plus inference latency and GPU memory footprint.

Results & Findings

BenchmarkMetric (higher is better)TabICLv2RealTabPFN‑2.5 (tuned)
TabArena (avg.)Accuracy / R²+3.2 % over baseline
TALENT (million‑row)Inference time (s)0.42 s per 10k rows1.18 s
Memory (GPU)Peak usage≈ 45 GB≈ 70 GB
  • No hyper‑parameter tuning: TabICLv2’s out‑of‑the‑box performance beats the tuned RealTabPFN‑2.5, demonstrating the strength of the synthetic pre‑training diversity.
  • Scalability: The new attention mechanism lets the model ingest tables with > 1 M rows on a single GPU, a regime where previous tabular foundation models either crashed or required multi‑GPU setups.
  • Ablation studies confirm that each pillar (synthetic engine, attention tweak, Muon optimizer) contributes a measurable boost (≈ 1–2 % each) to the final score.

Practical Implications

  • Rapid prototyping: Data scientists can drop TabICLv2 into a notebook, feed a handful of labeled rows, and obtain high‑quality predictions without spending time on feature engineering or model selection.
  • Edge‑friendly deployment: Because inference fits within 50 GB GPU memory and runs in sub‑second latency, the model can be served in SaaS platforms, internal ML APIs, or even on high‑end consumer GPUs.
  • Cost‑effective scaling: Companies dealing with massive logs, IoT telemetry, or click‑stream data can now apply a single, pre‑trained model instead of training separate gradient‑boosted trees for each dataset.
  • Open‑source ecosystem: With the code and weights released, the community can extend the synthetic generator to domain‑specific schemas (e.g., finance, healthcare) and fine‑tune TabICLv2 for niche regulatory constraints.

Limitations & Future Work

  • Synthetic‑real gap: Although the synthetic engine is diverse, certain domain‑specific quirks (e.g., time‑series autocorrelation, hierarchical categorical encodings) may still be under‑represented, potentially limiting performance on highly specialized tables.
  • Interpretability: Like most transformer‑based models, TabICLv2 offers limited insight into feature importance compared with classic tree models; integrating post‑hoc explainability tools will be essential for regulated industries.
  • Training compute: While inference is cheap, the pre‑training phase still requires several GPU‑days; future work could explore further optimizer or curriculum‑learning tricks to reduce this cost.
  • Extension to multimodal tables: The current design assumes homogeneous column types; extending the architecture to handle embedded images, free‑form text, or graph‑structured columns is an open research direction.

Bottom line: TabICLv2 demonstrates that a well‑designed synthetic pre‑training pipeline, paired with clever architectural tweaks, can deliver a “plug‑and‑play” tabular model that rivals heavily‑engineered baselines—opening the door for faster, more scalable data science workflows across the industry.

Authors

  • Jingang Qu
  • David Holzmüller
  • Gaël Varoquaux
  • Marine Le Morvan

Paper Information

  • arXiv ID: 2602.11139v1
  • Categories: cs.LG
  • Published: February 11, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »