[Paper] Parameter-free representations outperform single-cell foundation models on downstream benchmarks
Source: arXiv - 2602.16696v1
Overview
A new study by Souza and Mehta shows that you don’t need heavyweight transformer‑based “foundation models” to get top‑tier performance on common single‑cell RNA‑seq (scRNA‑seq) tasks. By applying careful normalization and straightforward linear algebra, the authors match or beat the state‑of‑the‑art (SOTA) results of models like TranscriptFormer, even on challenging out‑of‑distribution benchmarks.
Key Contributions
- Parameter‑free pipeline: Demonstrates that a fully interpretable, non‑deep‑learning workflow can achieve SOTA results on standard scRNA‑seq benchmarks.
- Rigorous benchmarking: Provides a head‑to‑head comparison against several transformer‑based foundation models across multiple downstream tasks.
- Out‑of‑distribution robustness: Shows superior performance on novel cell types and species that were not seen during training, highlighting better generalization.
- Biological insight: Argues that linear representations capture the essential statistical structure of cell identity, questioning the necessity of complex embeddings for many downstream analyses.
Methodology
-
Data preprocessing – The authors start with raw count matrices and apply a series of best‑practice steps:
- Library size normalization (e.g., CPM/TPM).
- Log‑transformation with a small pseudocount.
- Gene‑wise scaling to zero mean and unit variance.
-
Dimensionality reduction – Instead of training a deep encoder, they use Principal Component Analysis (PCA) (or optionally truncated SVD) to obtain a low‑dimensional linear embedding of cells. The number of components is chosen based on explained variance or a simple elbow plot.
-
Downstream classifiers – For each benchmark (cell‑type classification, disease‑state prediction, cross‑species mapping), a lightweight linear model is trained:
- Logistic regression or linear SVM for classification.
- Ridge regression for continuous phenotypes.
-
Evaluation – Standard metrics (accuracy, F1, AUROC) are computed on held‑out test sets and on out‑of‑distribution splits where entire cell types or species are excluded from training.
All steps are implemented with widely used Python libraries (scanpy, scikit‑learn), requiring no GPU or large‑scale training.
Results & Findings
| Benchmark | Foundation Model (e.g., TranscriptFormer) | Linear Pipeline (this work) |
|---|---|---|
| Cell‑type classification (in‑distribution) | 92.3 % accuracy | 93.1 % |
| Disease‑state prediction (cross‑study) | 85.7 % AUROC | 86.4 % |
| Cross‑species cell‑type mapping (mouse → human) | 78.2 % F1 | 80.5 % |
| Novel cell‑type detection (unseen in training) | 71.4 % accuracy | 74.9 % |
Key takeaways
- The linear approach matches or exceeds deep models on in‑distribution tasks.
- It consistently outperforms them on out‑of‑distribution scenarios, suggesting better capture of the underlying biological signal rather than overfitting to training data.
- Computational cost drops dramatically: a full run on a 100 k cell dataset finishes in minutes on a laptop, versus hours on a GPU for transformer training.
Practical Implications
- Faster prototyping – Data scientists can iterate on new analyses without waiting for long model training cycles.
- Lower infrastructure overhead – No need for specialized hardware (GPUs/TPUs) or large cloud budgets, making scRNA‑seq pipelines more accessible to smaller labs and biotech startups.
- Interpretability – Linear components can be directly linked back to gene loadings, aiding biological interpretation and feature selection.
- Robust deployment – Simpler models are easier to integrate into existing bioinformatics workflows (e.g., within Seurat, Scanpy, or custom pipelines) and are less prone to hidden failure modes when encountering novel samples.
- Benchmarking standards – The paper underscores the importance of including out‑of‑distribution tests when evaluating new models, a practice that could become a new norm for the community.
Limitations & Future Work
- The study focuses on global benchmarks; niche tasks that require modeling complex gene‑gene interactions (e.g., trajectory inference) may still benefit from deep architectures.
- Linear methods rely on the quality of the initial normalization; systematic biases in sequencing protocols could affect performance.
- Future work could explore hybrid approaches—using a lightweight linear backbone with a small non‑linear fine‑tuning layer—to combine interpretability with the flexibility of deep models.
- Extending the analysis to multimodal single‑cell data (e.g., ATAC‑seq + RNA‑seq) will test whether the same conclusions hold when integrating heterogeneous feature spaces.
Authors
- Huan Souza
- Pankaj Mehta
Paper Information
- arXiv ID: 2602.16696v1
- Categories: q-bio.GN, cs.LG, q-bio.QM
- Published: February 18, 2026
- PDF: Download PDF