[Paper] Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers
Source: arXiv - 2602.15783v1
Overview
This paper tackles a long‑standing bottleneck in computational pathology: how to preserve the tissue‑level context of whole‑slide images (WSIs) while classifying individual epithelial cells in cutaneous squamous cell carcinoma (cSCC). By representing an entire slide as a graph of cells and applying scalable Graph Transformer architectures, the authors achieve higher accuracy than state‑of‑the‑art image‑based models, especially when healthy and tumor cells look morphologically alike.
Key Contributions
- Full‑WSI cell‑graph representation: Converts every detected cell into a graph node, linking neighboring cells to capture spatial relationships.
- Scalable Graph Transformers (SGFormer & DIFFormer): Adapted transformer attention mechanisms to operate efficiently on graphs with tens of thousands of nodes.
- Empirical superiority over image‑based baselines: On both single‑slide and multi‑slide experiments, graph‑based models reach ~85 % balanced accuracy vs. ~78–81 % for the best convolutional/ViT approaches.
- Feature ablation study: Demonstrates that combining morphology, texture, and the class of surrounding non‑epithelial cells yields the most discriminative node embeddings.
- Practical pipeline for large WSIs: Shows how to split massive slides into manageable patches, build graphs, and still retain the benefits of global context.
Methodology
- Cell detection & feature extraction – A pretrained detector (e.g., HoVer‑Net) identifies every cell in a WSI. For each cell, the authors compute:
- Morphological descriptors (area, perimeter, shape factors)
- Texture descriptors (local intensity statistics)
- One‑hot encoding of the cell’s broader class (e.g., immune, stromal).
- Graph construction – Cells become nodes; edges connect each cell to its k nearest neighbors (k‑NN) based on Euclidean distance, forming a spatial graph that mirrors tissue architecture.
- Graph Transformer models –
- SGFormer (Scalable Graph Former) uses a hierarchical pooling scheme to reduce graph size before applying multi‑head self‑attention.
- DIFFormer (Dual‑Interaction Former) alternates between node‑wise self‑attention and edge‑wise message passing, enabling richer context modeling.
- Training & evaluation – 3‑fold cross‑validation on a single slide, then on a multi‑slide dataset (four 2560 × 2560 patches per slide). Balanced accuracy (average of sensitivity and specificity) is the primary metric.
The pipeline is deliberately modular: any cell detector or feature set can be swapped in, and the graph transformer can be replaced with other GNN variants.
Results & Findings
| Setting | Model | Balanced Accuracy |
|---|---|---|
| Single‑slide (3‑fold CV) | SGFormer | 85.2 ± 1.5 % |
| Single‑slide (3‑fold CV) | DIFFormer | 85.1 ± 2.5 % |
| Single‑slide (best image‑based) | – | 81.2 ± 3.0 % |
| Multi‑slide patches (3‑fold CV) | DIFFormer | 83.6 ± 1.9 % |
| Multi‑slide patches (3‑fold CV) | CellViT256 (state‑of‑the‑art image ViT) | 78.1 ± 0.5 % |
Key takeaways
- Graph Transformers consistently outperform the strongest convolutional/ViT baselines, even when the latter are given the same patch size.
- Adding the class of neighboring non‑epithelial cells improves performance, confirming that cellular context matters for distinguishing subtle morphological differences.
- The approach scales to realistic WSI sizes by processing a handful of large patches rather than the entire slide at once.
Practical Implications
- Pathology workflow integration – Labs can embed the graph‑based pipeline into existing digital pathology platforms to flag suspicious epithelial cells for a second look, reducing manual review time.
- Generalizable to other cancers – The graph formulation is agnostic to tissue type; any disease where micro‑environment cues are diagnostic (e.g., breast, lung) could benefit.
- Resource‑efficient inference – Graph Transformers require far fewer FLOPs than processing gigapixel images with deep CNNs or Vision Transformers, making them suitable for on‑premise deployment or edge devices in low‑resource settings.
- Explainability – Attention weights over graph edges highlight which neighboring cells most influence a classification decision, offering a natural avenue for model interpretability that aligns with pathologists’ reasoning.
- Data‑centric development – The study underscores the value of richer node features (texture + context) over raw pixel patches, encouraging developers to invest in robust feature engineering pipelines.
Limitations & Future Work
- Cell detection dependency – Errors in the upstream cell detector propagate to the graph, potentially limiting performance on low‑quality slides.
- Scalability ceiling – While the authors handle ~10k nodes per patch, whole‑slide graphs with >100k nodes still pose memory challenges; further hierarchical or sampling strategies are needed.
- Limited clinical validation – Experiments are confined to a modest dataset (few patients, specific cancer type). Larger, multi‑center studies are required to confirm generalizability.
- Future directions suggested by the authors include: integrating multi‑modal data (e.g., immunohistochemistry), exploring self‑supervised pretraining on cell graphs, and extending the framework to multi‑class tissue segmentation.
Authors
- Lucas Sancéré
- Noémie Moreau
- Katarzyna Bozek
Paper Information
- arXiv ID: 2602.15783v1
- Categories: cs.CV
- Published: February 17, 2026
- PDF: Download PDF