[Paper] Parameter-free representations outperform single-cell foundation models on downstream benchmarks

Published: 2 months ago (February 18, 2026 at 01:42 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16696v1

Overview

A new study by Souza and Mehta shows that you don’t need heavyweight transformer‑based “foundation models” to get top‑tier performance on common single‑cell RNA‑seq (scRNA‑seq) tasks. By applying careful normalization and straightforward linear algebra, the authors match or beat the state‑of‑the‑art (SOTA) results of models like TranscriptFormer, even on challenging out‑of‑distribution benchmarks.

Key Contributions

Parameter‑free pipeline: Demonstrates that a fully interpretable, non‑deep‑learning workflow can achieve SOTA results on standard scRNA‑seq benchmarks.
Rigorous benchmarking: Provides a head‑to‑head comparison against several transformer‑based foundation models across multiple downstream tasks.
Out‑of‑distribution robustness: Shows superior performance on novel cell types and species that were not seen during training, highlighting better generalization.
Biological insight: Argues that linear representations capture the essential statistical structure of cell identity, questioning the necessity of complex embeddings for many downstream analyses.

Methodology

Data preprocessing – The authors start with raw count matrices and apply a series of best‑practice steps:
- Library size normalization (e.g., CPM/TPM).
- Log‑transformation with a small pseudocount.
- Gene‑wise scaling to zero mean and unit variance.
Dimensionality reduction – Instead of training a deep encoder, they use Principal Component Analysis (PCA) (or optionally truncated SVD) to obtain a low‑dimensional linear embedding of cells. The number of components is chosen based on explained variance or a simple elbow plot.
Downstream classifiers – For each benchmark (cell‑type classification, disease‑state prediction, cross‑species mapping), a lightweight linear model is trained:
- Logistic regression or linear SVM for classification.
- Ridge regression for continuous phenotypes.
Evaluation – Standard metrics (accuracy, F1, AUROC) are computed on held‑out test sets and on out‑of‑distribution splits where entire cell types or species are excluded from training.

All steps are implemented with widely used Python libraries (scanpy, scikit‑learn), requiring no GPU or large‑scale training.

Results & Findings

Benchmark	Foundation Model (e.g., TranscriptFormer)	Linear Pipeline (this work)
Cell‑type classification (in‑distribution)	92.3 % accuracy	93.1 %
Disease‑state prediction (cross‑study)	85.7 % AUROC	86.4 %
Cross‑species cell‑type mapping (mouse → human)	78.2 % F1	80.5 %
Novel cell‑type detection (unseen in training)	71.4 % accuracy	74.9 %

Key takeaways

The linear approach matches or exceeds deep models on in‑distribution tasks.
It consistently outperforms them on out‑of‑distribution scenarios, suggesting better capture of the underlying biological signal rather than overfitting to training data.
Computational cost drops dramatically: a full run on a 100 k cell dataset finishes in minutes on a laptop, versus hours on a GPU for transformer training.

Practical Implications

Faster prototyping – Data scientists can iterate on new analyses without waiting for long model training cycles.
Lower infrastructure overhead – No need for specialized hardware (GPUs/TPUs) or large cloud budgets, making scRNA‑seq pipelines more accessible to smaller labs and biotech startups.
Interpretability – Linear components can be directly linked back to gene loadings, aiding biological interpretation and feature selection.
Robust deployment – Simpler models are easier to integrate into existing bioinformatics workflows (e.g., within Seurat, Scanpy, or custom pipelines) and are less prone to hidden failure modes when encountering novel samples.
Benchmarking standards – The paper underscores the importance of including out‑of‑distribution tests when evaluating new models, a practice that could become a new norm for the community.

Limitations & Future Work

The study focuses on global benchmarks; niche tasks that require modeling complex gene‑gene interactions (e.g., trajectory inference) may still benefit from deep architectures.
Linear methods rely on the quality of the initial normalization; systematic biases in sequencing protocols could affect performance.
Future work could explore hybrid approaches—using a lightweight linear backbone with a small non‑linear fine‑tuning layer—to combine interpretability with the flexibility of deep models.
Extending the analysis to multimodal single‑cell data (e.g., ATAC‑seq + RNA‑seq) will test whether the same conclusions hold when integrating heterogeneous feature spaces.

Authors

Huan Souza
Pankaj Mehta

Paper Information

arXiv ID: 2602.16696v1
Categories: q-bio.GN, cs.LG, q-bio.QM
Published: February 18, 2026
PDF: Download PDF

[Paper] Parameter-free representations outperform single-cell foundation models on downstream benchmarks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Unifying approach to uniform expressivity of graph neural networks

[Paper] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges