Why do tree-based models still outperform deep learning on tabular data?
Source: Dev.to
Introduction
Deep neural networks have revolutionized image and text processing, but when it comes to spreadsheet‑style tabular data, classic tree‑based methods often still come out on top.
Empirical Findings
A large benchmark covering many datasets showed that tree‑based models such as XGBoost and Random Forests consistently outperform deep learning models on medium‑sized tables (≈10 k rows), even after extensive hyper‑parameter tuning of the neural networks. The pattern persisted across a wide range of settings and checks.
Why Trees Perform Better
- Robustness to irrelevant features – trees can ignore useless columns without harming performance.
- Preservation of data shape – tree algorithms work directly with the original tabular structure, avoiding the need for extensive preprocessing.
- Ability to capture irregular patterns – decision trees can model heterogeneous interactions and non‑linearities that are harder for standard feed‑forward networks to learn on tabular data.
Implications
These results highlight that deep learning is not a universal solution; specialized approaches are still needed for tabular problems. The authors released the full suite of experiments, raw results, and configuration details to enable reproducibility and further research.
Takeaway
When your dataset is organized in rows and columns, don’t automatically assume a deep neural network will be optimal—tree‑based models may still be the smarter choice.