[Paper] ProteinPNet: Prototypical Part Networks for Concept Learning in Spatial Proteomics
Source: arXiv - 2512.02983v1
Overview
The paper introduces ProteinPNet, a prototype‑based deep learning framework that learns interpretable “spatial motifs” directly from high‑dimensional spatial proteomics data of tumor microenvironments (TMEs). By embedding prototypical part networks into the training loop, the model discovers biologically meaningful patterns that differentiate tumor subtypes, offering a bridge between black‑box AI and actionable insights for precision oncology.
Key Contributions
- Prototype‑driven architecture: Extends prototypical part networks (originally used for image classification) to handle multiplexed spatial proteomics, learning discriminative spatial prototypes end‑to‑end.
- Faithful interpretability: Unlike post‑hoc explainers, ProteinPNet’s prototypes are part of the model’s decision process, guaranteeing that the highlighted patterns truly drive predictions.
- Synthetic benchmark with ground truth: Provides a controlled dataset where true spatial motifs are known, enabling quantitative evaluation of prototype recovery.
- Real‑world validation on lung cancer: Applies the method to a large‑scale spatial proteomics cohort, uncovering prototypes linked to immune infiltration and tissue modularity that align with known tumor subtypes.
- Graph‑ and morphology‑based analysis pipeline: Introduces tools to visualize and quantify the spatial arrangement of cells contributing to each prototype, making the results accessible to biologists and clinicians.
Methodology
- Data Representation – Each tissue section is modeled as a graph: nodes correspond to individual cells (or spots) with high‑dimensional protein expression vectors, and edges encode spatial proximity (e.g., Delaunay triangulation).
- Feature Extraction – A graph neural network (GNN) learns a latent embedding for every node, capturing both molecular and spatial context.
- Prototype Layer – A set of learnable prototype vectors lives in the same embedding space. For each node, the network computes similarity scores to all prototypes (e.g., cosine distance).
- Prototype Activation Maps – Nodes with high similarity to a prototype form a spatial “activation map.” The model aggregates these maps (e.g., max‑pooling) to produce a global representation used for downstream classification (tumor subtype).
- Supervised Training with Prototype Regularization – The loss combines standard cross‑entropy with regularizers that (a) push prototypes toward real data patches (prototype‑coverage loss) and (b) encourage sparsity/compactness of activation maps (interpretability loss).
- Evaluation – On synthetic data, the recovered prototypes are compared to ground‑truth motifs using IoU and clustering metrics. On real data, prototypes are inspected visually and correlated with known biological markers (e.g., CD8+ T‑cell density).
Results & Findings
- Synthetic data: ProteinPNet recovers >90 % of ground‑truth motifs (IoU ≈ 0.85) while maintaining classification accuracy comparable to a vanilla GNN.
- Lung cancer cohort: The model achieves ~84 % accuracy in distinguishing major histological subtypes (adenocarcinoma vs. squamous cell carcinoma).
- Biologically meaningful prototypes:
- Prototype A highlights dense clusters of immune cells (high CD45, CD8) surrounding tumor nests, correlating with “immune‑inflamed” tumors.
- Prototype B captures stromal regions rich in fibroblast markers (α‑SMA) and low immune presence, matching “desert” phenotypes.
- Prototype C isolates micro‑vascular structures (VE‑Cadherin) that differ between subtypes.
- Graph‑level insights: Network analysis shows that prototypes correspond to distinct community structures (modularity scores) within the cell‑cell interaction graph, suggesting that spatial organization itself is a predictive biomarker.
Practical Implications
- Rapid biomarker discovery: Researchers can train ProteinPNet on new spatial omics datasets to surface candidate spatial signatures without manual region‑of‑interest annotation.
- Explainable AI for clinicians: Because prototypes are visualizable as cell‑level heatmaps, pathologists can validate model reasoning against histopathology slides, fostering trust in AI‑assisted diagnostics.
- Integration into pipelines: The prototype layer can be swapped into existing GNN‑based pipelines (e.g., for single‑cell RNA‑seq spatial data), offering a plug‑and‑play interpretability module.
- Targeted therapy design: Identified immune‑rich or stromal‑rich motifs could guide patient stratification for immunotherapy vs. anti‑fibrotic strategies.
- Regulatory friendliness: Models that provide built‑in, faithful explanations align better with emerging AI‑in‑medicine regulations that demand transparency.
Limitations & Future Work
- Scalability: Prototype learning adds overhead; training on whole‑slide images with millions of cells may require graph sampling or hierarchical pooling.
- Prototype count selection: The number of prototypes is a hyper‑parameter; too few may miss subtle patterns, too many can dilute interpretability. Automated selection strategies are not explored.
- Cross‑modality validation: The study focuses on a single lung cancer proteomics platform; extending to multiplexed imaging (e.g., CODEX) or spatial transcriptomics will test generality.
- Causal inference: While prototypes correlate with biological processes, the framework does not establish causality; integrating perturbation data (e.g., CRISPR screens) could strengthen mechanistic claims.
ProteinPNet demonstrates that prototype‑based deep learning can turn the “black box” of spatial omics into a toolbox of interpretable, biologically grounded patterns—an advance that could accelerate both research discovery and clinical decision‑making.
Authors
- Louis McConnell
- Jieran Sun
- Theo Maffei
- Raphael Gottardo
- Marianna Rapsomaniki
Paper Information
- arXiv ID: 2512.02983v1
- Categories: cs.LG
- Published: December 2, 2025
- PDF: Download PDF