Your Model Choice Doesn't Matter Nearly as Much as You Think...And That's Actually Good News

Published: 1 month ago (January 9, 2026 at 10:51 AM EST)

5 min read

Source: Dev.to

Introduction

I read about this study on Twitter and couldn’t stop thinking about it.

In 2009, neuroscientists put a dead Atlantic salmon in an fMRI scanner, showed it pictures of humans in social situations, and asked it to determine what emotion the people were feeling. The scanner detected brain activity, and the salmon appeared to be thinking.

Obviously, the fish wasn’t thinking—the “activity” was random noise. The point is that without proper statistical controls, your tools will find patterns where none exist.

Null Models in LLM Benchmarks

This problem is happening in machine learning right now. We celebrate model improvements that vanish when we add proper baselines. It’s the same as finding brain activity in a dead fish, except now we call it architectural innovation.

Researchers submitted null models to LLM benchmarks. These models output constant responses regardless of input; they don’t read the question, they just generate formatted text that looks good.
These null models achieved 80‑90 % win rates on AlpacaEval.

“A model that completely ignores your input can hit 90 %. That’s not measuring intelligence; that’s measuring how well you format markdown.”

The paper “Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates” (arXiv: 2410.07137) should terrify anyone making decisions based on leaderboard positions.

Shortcut Learning in Vision

The issue isn’t isolated. The paper “Shortcut Learning in Deep Neural Networks” (arXiv: 2004.07780) shows ImageNet models learn texture instead of shape. Show them an elephant with cat texture, and they confidently say “cat.” They learned the wrong thing entirely, but the benchmark never caught it.

Simple Baselines Beat Complex Methods

There’s a whole genre of papers with “An Embarrassingly Simple Approach” in the title. They keep beating state‑of‑the‑art by just not doing the complex thing.

Task	What “simple” did	Result
Zero‑shot learning	Linear regression beats fancy meta‑learning architectures	New records
One‑shot learning	Prune irrelevant features from a pretrained model	Beats all complex meta‑learning networks on miniImageNet & tieredImageNet
Imbalanced semi‑supervised learning	Basic resampling	12‑16 % improvement over complex balancing techniques

The pattern is clear: these papers didn’t discover new techniques; they simply implemented the baseline that everyone else skipped.

Tabular Data: Deep Learning Isn’t Always Best

The most damning evidence comes from tabular data.

“Tabular Data: Deep Learning Is Not All You Need” (arXiv: 2106.03253) compared fancy deep‑learning models against XGBoost, a 2016 algorithm most practitioners already know.
XGBoost won on most datasets, trained significantly faster, and only the deep models that originated each dataset performed best on their own “home turf.”

When the researchers tested models from four recent papers across eleven new datasets, every “novel architecture” dominated only its original dataset and failed everywhere else.

“That’s not innovation. That’s p‑hacking with neural nets.”

When Deep Learning Does Help

Deep learning can pull ahead on tabular data in specific cases:

Massive datasets (≥ 1 million rows) where manual feature engineering is infeasible.
Situations requiring learning of complex feature interactions automatically.

But those scenarios are rarer than the hype suggests. For most tabular problems, XGBoost with good features beats any deep model with bad ones.

Andrew Ng: “Improving data quality often beats developing a better model architecture.”

Microsoft’s Phi models demonstrated this: a tiny model trained on high‑quality synthetic textbooks outperformed massive models trained on noisy web scrapes—not because of architecture, but because of data.

The Bigger Picture

The pattern holds everywhere:

XGBoost + good features beats any deep model with bad features.
Good prompts on GPT‑3.5 beat bad prompts on GPT‑4.
Clean data beats novel architecture.

Why do we ignore this? Because “we cleaned our data better” doesn’t win Best Paper awards, while “novel attention mechanism with architectural innovations” does.

“Troubling Trends in Machine Learning Scholarship” (arXiv: 1807.03341) documents the problem:

Papers claim architectural innovations that are really just better hyper‑parameter tuning.
Authors compare tuned models to untuned baselines and declare victory.
They cherry‑pick datasets where their approach works.
They skip simple baselines that would expose the weakness.
They use math to make trivial ideas sound profound.

We’re drowning in “innovations” that don’t replicate outside their original paper.

Practical Takeaways for AI Practitioners

Start with a strong baseline before reaching for the latest transformer.
- Tabular data → XGBoost.
- Text classification → TF‑IDF + Logistic Regression.
- Code search → Cosine similarity.
Own what you control. Model choice is temporary; data pipelines and evaluation frameworks last.
Invest in data quality.
- Clean, consistent labeling.
- Remove duplicates.
- Fix class imbalance.
- Add proper null handling.
Master prompt engineering. It’s model‑agnostic and transfers across Claude, GPT, Gemini, etc.
- Break problems into steps.
- Provide clear examples.
- Use structured outputs.
- Iterate based on failures.
Add proper controls. The dead‑salmon study taught neuroscientists to test null hypotheses—do the same for your ML experiments.

Final Thought

If your model can’t convincingly beat a simple, well‑engineered baseline, you’re essentially looking at a dead fish. Focus on data, baselines, and rigorous evaluation, and you’ll avoid the pitfalls of “architectural hype.”

Is your improvement large enough to matter?
Does it beat the simple baseline?
Are you comparing tuned vs tuned, or tuned vs vanilla?
Does it work beyond your training distribution?

If your gains disappear when you add these controls, you’re celebrating noise.

That interpretability paper makes the same point. If your interpretability tool finds convincing patterns in randomly initialized, untrained models, you’re not finding meaning—you’re finding statistical noise. (arXiv:2512.18792)

Saliency maps look plausible on random networks. Sparse autoencoders find “interpretable features” in random transformers. Benchmark scores improve with null models. Architectures beat baselines that were never properly tuned.

Before you chase the latest model release, try the simple baseline. Fix your data. Invest in prompts that transfer. Add controls to your evals.

You don’t need insider access or the newest model. You need to own your data, your prompts, your retrieval, and your evaluation. The model is often the least interesting part of the system, which is exactly why it gets the most hype.

Your model choice doesn’t matter nearly as much as you think. Once you accept that, you can focus on the things that do.

Your Model Choice Doesn't Matter Nearly as Much as You Think...And That's Actually Good News

Introduction

Null Models in LLM Benchmarks

Shortcut Learning in Vision

Simple Baselines Beat Complex Methods

Tabular Data: Deep Learning Isn’t Always Best

When Deep Learning Does Help

The Bigger Picture

Practical Takeaways for AI Practitioners

Final Thought

Related posts

[Paper] Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study

Measuring What Matters with NeMo Agent Toolkit

Artificial Analysis overhauls its AI Intelligence Index, replacing popular benchmarks with 'real-world' tests

Mark Zuckerberg says Meta is launching its own AI infrastructure initiative