[Paper] Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training
Source: arXiv - 2512.17891v1
Overview
The paper introduces Keypoint Counting Classifiers (KCCs), a technique that can turn any pre‑trained Vision Transformer (ViT) into a self‑explainable model without any additional training. By leveraging the ViT’s innate ability to locate matching keypoints across images, KCCs produce decisions that are directly visualizable on the input, bridging the gap between powerful foundation models and the transparency that developers and end‑users demand.
Key Contributions
- Training‑free self‑explainability: Converts a frozen ViT into an interpretable classifier without retraining or architectural changes.
- Keypoint‑based decision rule: Uses the count of matched keypoints between a test image and class‑specific prototype patches to drive predictions.
- Human‑readable explanations: Generates visual overlays that show exactly which image regions contributed to the final class vote.
- Comprehensive evaluation: Demonstrates superior human‑machine communication metrics compared to recent self‑explainable baselines on standard vision benchmarks.
- Broad applicability: Works with any well‑trained ViT (e.g., ViT‑B/16, DeiT, CLIP vision encoder), making it a drop‑in transparency layer for existing foundation models.
Methodology
- Extract patch embeddings: A frozen ViT processes an input image, producing a set of token embeddings—one per image patch.
- Identify keypoints: For each token, the method computes similarity scores against a small set of class prototypes (representative patches collected from the training set). High similarity indicates a “keypoint” that matches a known visual pattern for that class.
- Count matches per class: The number of keypoints that exceed a similarity threshold is tallied for every class.
- Decision rule: The class with the highest keypoint count wins. Because the count is derived from explicit patch matches, the reasoning is transparent.
- Visualization: The matched patches are highlighted on the original image, giving developers a clear, pixel‑level explanation of why the model chose a particular label.
The entire pipeline runs inference‑only; the only extra data needed are the prototype patches, which can be extracted once from the original training set.
Results & Findings
- Accuracy trade‑off: KCCs retain ≈95 % of the original ViT’s top‑1 accuracy on ImageNet‑1k, while providing explanations.
- Explanation quality: Human studies show a 30 % increase in trust and faster decision verification compared to prior self‑explainable methods (e.g., ProtoPNet, Attention Rollout).
- Speed: Adding the counting step adds < 5 ms per image on a single RTX‑3090, keeping the system suitable for real‑time applications.
- Robustness: The keypoint counts are stable under common corruptions (noise, blur), indicating that explanations are not overly sensitive to minor perturbations.
Practical Implications
- Deployable transparency: Companies can wrap existing ViT‑based services (image classification, content moderation, medical imaging) with KCCs to satisfy regulatory or internal audit requirements without costly model retraining.
- Debugging & data quality: Visual keypoint maps help engineers spot mislabeled data or systematic biases (e.g., a model relying on background textures).
- Interactive tools: Front‑end UI can overlay keypoint explanations, enabling end‑users to understand predictions in domains like e‑commerce (why a product is categorized) or autonomous driving (what visual cues triggered a detection).
- Foundation model integration: Since KCCs work with CLIP’s vision encoder, multimodal systems can inherit explainability for the visual branch while keeping the language side untouched.
Limitations & Future Work
- Prototype selection: The quality of explanations depends on the representativeness of the stored prototype patches; suboptimal prototypes can lead to noisy keypoint counts.
- Scalability to many classes: Counting keypoints for thousands of classes may increase memory overhead; the authors suggest hierarchical prototype clustering as a mitigation.
- Beyond classification: The current formulation handles image‑level labels; extending KCCs to detection, segmentation, or video tasks remains an open challenge.
- Adversarial robustness: While more stable than some baselines, the method’s reliance on similarity thresholds could be exploited; future work could explore certified bounds for keypoint counting.
Overall, KCCs provide a pragmatic path to make today’s powerful ViT foundation models both high‑performing and self‑explainable, opening the door for wider adoption in safety‑critical and compliance‑driven industries.
Authors
- Kristoffer Wickstrøm
- Teresa Dorszewski
- Siyan Chen
- Michael Kampffmeyer
- Elisabeth Wetzer
- Robert Jenssen
Paper Information
- arXiv ID: 2512.17891v1
- Categories: cs.CV
- Published: December 19, 2025
- PDF: Download PDF