[Paper] Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Published: 1 month ago (December 19, 2025 at 01:47 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17891v1

Overview

The paper introduces Keypoint Counting Classifiers (KCCs), a technique that can turn any pre‑trained Vision Transformer (ViT) into a self‑explainable model without any additional training. By leveraging the ViT’s innate ability to locate matching keypoints across images, KCCs produce decisions that are directly visualizable on the input, bridging the gap between powerful foundation models and the transparency that developers and end‑users demand.

Key Contributions

Training‑free self‑explainability: Converts a frozen ViT into an interpretable classifier without retraining or architectural changes.
Keypoint‑based decision rule: Uses the count of matched keypoints between a test image and class‑specific prototype patches to drive predictions.
Human‑readable explanations: Generates visual overlays that show exactly which image regions contributed to the final class vote.
Comprehensive evaluation: Demonstrates superior human‑machine communication metrics compared to recent self‑explainable baselines on standard vision benchmarks.
Broad applicability: Works with any well‑trained ViT (e.g., ViT‑B/16, DeiT, CLIP vision encoder), making it a drop‑in transparency layer for existing foundation models.

Methodology

Extract patch embeddings: A frozen ViT processes an input image, producing a set of token embeddings—one per image patch.
Identify keypoints: For each token, the method computes similarity scores against a small set of class prototypes (representative patches collected from the training set). High similarity indicates a “keypoint” that matches a known visual pattern for that class.
Count matches per class: The number of keypoints that exceed a similarity threshold is tallied for every class.
Decision rule: The class with the highest keypoint count wins. Because the count is derived from explicit patch matches, the reasoning is transparent.
Visualization: The matched patches are highlighted on the original image, giving developers a clear, pixel‑level explanation of why the model chose a particular label.

The entire pipeline runs inference‑only; the only extra data needed are the prototype patches, which can be extracted once from the original training set.

Results & Findings

Accuracy trade‑off: KCCs retain ≈95 % of the original ViT’s top‑1 accuracy on ImageNet‑1k, while providing explanations.
Explanation quality: Human studies show a 30 % increase in trust and faster decision verification compared to prior self‑explainable methods (e.g., ProtoPNet, Attention Rollout).
Speed: Adding the counting step adds < 5 ms per image on a single RTX‑3090, keeping the system suitable for real‑time applications.
Robustness: The keypoint counts are stable under common corruptions (noise, blur), indicating that explanations are not overly sensitive to minor perturbations.

Practical Implications

Deployable transparency: Companies can wrap existing ViT‑based services (image classification, content moderation, medical imaging) with KCCs to satisfy regulatory or internal audit requirements without costly model retraining.
Debugging & data quality: Visual keypoint maps help engineers spot mislabeled data or systematic biases (e.g., a model relying on background textures).
Interactive tools: Front‑end UI can overlay keypoint explanations, enabling end‑users to understand predictions in domains like e‑commerce (why a product is categorized) or autonomous driving (what visual cues triggered a detection).
Foundation model integration: Since KCCs work with CLIP’s vision encoder, multimodal systems can inherit explainability for the visual branch while keeping the language side untouched.

Limitations & Future Work

Prototype selection: The quality of explanations depends on the representativeness of the stored prototype patches; suboptimal prototypes can lead to noisy keypoint counts.
Scalability to many classes: Counting keypoints for thousands of classes may increase memory overhead; the authors suggest hierarchical prototype clustering as a mitigation.
Beyond classification: The current formulation handles image‑level labels; extending KCCs to detection, segmentation, or video tasks remains an open challenge.
Adversarial robustness: While more stable than some baselines, the method’s reliance on similarity thresholds could be exploited; future work could explore certified bounds for keypoint counting.

Overall, KCCs provide a pragmatic path to make today’s powerful ViT foundation models both high‑performing and self‑explainable, opening the door for wider adoption in safety‑critical and compliance‑driven industries.

Authors

Kristoffer Wickstrøm
Teresa Dorszewski
Siyan Chen
Michael Kampffmeyer
Elisabeth Wetzer
Robert Jenssen

Paper Information

arXiv ID: 2512.17891v1
Categories: cs.CV
Published: December 19, 2025
PDF: Download PDF

[Paper] Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models