[Paper] Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers

Published: 3 days ago (February 17, 2026 at 01:17 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.15783v1

Overview

This paper tackles a long‑standing bottleneck in computational pathology: how to preserve the tissue‑level context of whole‑slide images (WSIs) while classifying individual epithelial cells in cutaneous squamous cell carcinoma (cSCC). By representing an entire slide as a graph of cells and applying scalable Graph Transformer architectures, the authors achieve higher accuracy than state‑of‑the‑art image‑based models, especially when healthy and tumor cells look morphologically alike.

Key Contributions

Full‑WSI cell‑graph representation: Converts every detected cell into a graph node, linking neighboring cells to capture spatial relationships.
Scalable Graph Transformers (SGFormer & DIFFormer): Adapted transformer attention mechanisms to operate efficiently on graphs with tens of thousands of nodes.
Empirical superiority over image‑based baselines: On both single‑slide and multi‑slide experiments, graph‑based models reach ~85 % balanced accuracy vs. ~78–81 % for the best convolutional/ViT approaches.
Feature ablation study: Demonstrates that combining morphology, texture, and the class of surrounding non‑epithelial cells yields the most discriminative node embeddings.
Practical pipeline for large WSIs: Shows how to split massive slides into manageable patches, build graphs, and still retain the benefits of global context.

Methodology

Cell detection & feature extraction – A pretrained detector (e.g., HoVer‑Net) identifies every cell in a WSI. For each cell, the authors compute:
- Morphological descriptors (area, perimeter, shape factors)
- Texture descriptors (local intensity statistics)
- One‑hot encoding of the cell’s broader class (e.g., immune, stromal).
Graph construction – Cells become nodes; edges connect each cell to its k nearest neighbors (k‑NN) based on Euclidean distance, forming a spatial graph that mirrors tissue architecture.
Graph Transformer models –
- SGFormer (Scalable Graph Former) uses a hierarchical pooling scheme to reduce graph size before applying multi‑head self‑attention.
- DIFFormer (Dual‑Interaction Former) alternates between node‑wise self‑attention and edge‑wise message passing, enabling richer context modeling.
Training & evaluation – 3‑fold cross‑validation on a single slide, then on a multi‑slide dataset (four 2560 × 2560 patches per slide). Balanced accuracy (average of sensitivity and specificity) is the primary metric.

The pipeline is deliberately modular: any cell detector or feature set can be swapped in, and the graph transformer can be replaced with other GNN variants.

Results & Findings

Setting	Model	Balanced Accuracy
Single‑slide (3‑fold CV)	SGFormer	85.2 ± 1.5 %
Single‑slide (3‑fold CV)	DIFFormer	85.1 ± 2.5 %
Single‑slide (best image‑based)	–	81.2 ± 3.0 %
Multi‑slide patches (3‑fold CV)	DIFFormer	83.6 ± 1.9 %
Multi‑slide patches (3‑fold CV)	CellViT256 (state‑of‑the‑art image ViT)	78.1 ± 0.5 %

Key takeaways

Graph Transformers consistently outperform the strongest convolutional/ViT baselines, even when the latter are given the same patch size.
Adding the class of neighboring non‑epithelial cells improves performance, confirming that cellular context matters for distinguishing subtle morphological differences.
The approach scales to realistic WSI sizes by processing a handful of large patches rather than the entire slide at once.

Practical Implications

Pathology workflow integration – Labs can embed the graph‑based pipeline into existing digital pathology platforms to flag suspicious epithelial cells for a second look, reducing manual review time.
Generalizable to other cancers – The graph formulation is agnostic to tissue type; any disease where micro‑environment cues are diagnostic (e.g., breast, lung) could benefit.
Resource‑efficient inference – Graph Transformers require far fewer FLOPs than processing gigapixel images with deep CNNs or Vision Transformers, making them suitable for on‑premise deployment or edge devices in low‑resource settings.
Explainability – Attention weights over graph edges highlight which neighboring cells most influence a classification decision, offering a natural avenue for model interpretability that aligns with pathologists’ reasoning.
Data‑centric development – The study underscores the value of richer node features (texture + context) over raw pixel patches, encouraging developers to invest in robust feature engineering pipelines.

Limitations & Future Work

Cell detection dependency – Errors in the upstream cell detector propagate to the graph, potentially limiting performance on low‑quality slides.
Scalability ceiling – While the authors handle ~10k nodes per patch, whole‑slide graphs with >100k nodes still pose memory challenges; further hierarchical or sampling strategies are needed.
Limited clinical validation – Experiments are confined to a modest dataset (few patients, specific cancer type). Larger, multi‑center studies are required to confirm generalizability.
Future directions suggested by the authors include: integrating multi‑modal data (e.g., immunohistochemistry), exploring self‑supervised pretraining on cell graphs, and extending the framework to multi‑class tissue segmentation.

Authors

Lucas Sancéré
Noémie Moreau
Katarzyna Bozek

Paper Information

arXiv ID: 2602.15783v1
Categories: cs.CV
Published: February 17, 2026
PDF: Download PDF

[Paper] Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

[Paper] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

[Paper] Human-level 3D shape perception emerges from multi-view learning

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting