[Paper] TabICLv2: A better, faster, scalable, and open tabular foundation model
Source: arXiv - 2602.11139v1
Overview
TabICLv2 is the latest “foundation model” for tabular data, pushing the limits of what large‑scale, pre‑trained models can do on spreadsheets, CSVs, and relational tables. By combining a richer synthetic data generator, smarter architecture tweaks, and a new optimizer, the authors show that a single model can beat heavily‑tuned ensembles on both regression and classification tasks—while staying fast enough to run on a single GPU with < 50 GB memory.
Key Contributions
- Diverse synthetic pre‑training engine – automatically creates millions of varied tabular datasets (different column types, missingness patterns, feature interactions) to expose the model to a broad “world” of tables.
- Scalable softmax‑in‑attention – a novel attention formulation that keeps the computational cost low for long feature sequences, enabling the model to handle millions of rows without exploding memory.
- Muon optimizer – replaces the standard AdamW during pre‑training, delivering faster convergence and better generalisation on downstream tabular tasks.
- State‑of‑the‑art performance – on the TabArena and TALENT benchmarks, TabICLv2 outperforms RealTabPFN‑2.5 even though the latter uses hyper‑parameter tuning, ensembling, and fine‑tuning on real data.
- Open‑source release – inference code and pretrained weights are publicly available, with the synthetic data engine and training scripts promised soon.
Methodology
1. Synthetic Data Generation
- The authors built a pipeline that samples random schemas (numeric, categorical, datetime, text), injects realistic noise (missing values, outliers), and creates target variables using a mix of linear, tree‑based, and neural functions.
- This yields a high‑diversity pre‑training corpus that mimics the heterogeneity seen in real‑world tables, reducing the need for massive labeled datasets.
2. Model Architecture
- TabICLv2 is a transformer‑style encoder that treats each column as a token and each row as a “sequence”.
- The scalable softmax‑in‑attention computes attention over rows in a chunked fashion, avoiding the quadratic blow‑up of classic self‑attention while preserving the ability to capture long‑range dependencies across rows.
3. Training Protocol
- Pre‑training runs for a modest number of steps (relative to earlier TabPFN models) using the Muon optimizer, which adapts learning rates per‑parameter more aggressively than AdamW.
- No task‑specific fine‑tuning is performed; the model is evaluated directly via in‑context learning: a few example rows + a query row are fed to the model, which predicts the target.
4. Evaluation
- Benchmarks: TabArena (a collection of 100+ public tabular datasets) and TALENT (large‑scale, million‑row tables).
- Metrics: standard regression (RMSE, R²) and classification (accuracy, F1) scores, plus inference latency and GPU memory footprint.
Results & Findings
| Benchmark | Metric (higher is better) | TabICLv2 | RealTabPFN‑2.5 (tuned) |
|---|---|---|---|
| TabArena (avg.) | Accuracy / R² | +3.2 % over baseline | – |
| TALENT (million‑row) | Inference time (s) | 0.42 s per 10k rows | 1.18 s |
| Memory (GPU) | Peak usage | ≈ 45 GB | ≈ 70 GB |
- No hyper‑parameter tuning: TabICLv2’s out‑of‑the‑box performance beats the tuned RealTabPFN‑2.5, demonstrating the strength of the synthetic pre‑training diversity.
- Scalability: The new attention mechanism lets the model ingest tables with > 1 M rows on a single GPU, a regime where previous tabular foundation models either crashed or required multi‑GPU setups.
- Ablation studies confirm that each pillar (synthetic engine, attention tweak, Muon optimizer) contributes a measurable boost (≈ 1–2 % each) to the final score.
Practical Implications
- Rapid prototyping: Data scientists can drop TabICLv2 into a notebook, feed a handful of labeled rows, and obtain high‑quality predictions without spending time on feature engineering or model selection.
- Edge‑friendly deployment: Because inference fits within 50 GB GPU memory and runs in sub‑second latency, the model can be served in SaaS platforms, internal ML APIs, or even on high‑end consumer GPUs.
- Cost‑effective scaling: Companies dealing with massive logs, IoT telemetry, or click‑stream data can now apply a single, pre‑trained model instead of training separate gradient‑boosted trees for each dataset.
- Open‑source ecosystem: With the code and weights released, the community can extend the synthetic generator to domain‑specific schemas (e.g., finance, healthcare) and fine‑tune TabICLv2 for niche regulatory constraints.
Limitations & Future Work
- Synthetic‑real gap: Although the synthetic engine is diverse, certain domain‑specific quirks (e.g., time‑series autocorrelation, hierarchical categorical encodings) may still be under‑represented, potentially limiting performance on highly specialized tables.
- Interpretability: Like most transformer‑based models, TabICLv2 offers limited insight into feature importance compared with classic tree models; integrating post‑hoc explainability tools will be essential for regulated industries.
- Training compute: While inference is cheap, the pre‑training phase still requires several GPU‑days; future work could explore further optimizer or curriculum‑learning tricks to reduce this cost.
- Extension to multimodal tables: The current design assumes homogeneous column types; extending the architecture to handle embedded images, free‑form text, or graph‑structured columns is an open research direction.
Bottom line: TabICLv2 demonstrates that a well‑designed synthetic pre‑training pipeline, paired with clever architectural tweaks, can deliver a “plug‑and‑play” tabular model that rivals heavily‑engineered baselines—opening the door for faster, more scalable data science workflows across the industry.
Authors
- Jingang Qu
- David Holzmüller
- Gaël Varoquaux
- Marine Le Morvan
Paper Information
- arXiv ID: 2602.11139v1
- Categories: cs.LG
- Published: February 11, 2026
- PDF: Download PDF