[Paper] CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

Published: (November 26, 2025 at 10:38 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21503v1

Overview

The paper introduces CanKD, a cross‑attention‑based knowledge‑distillation framework that lets a student network “look at” every pixel of a teacher’s feature map when learning its own representations. By turning the distillation process into a non‑local, pixel‑wise interaction, the authors achieve noticeably better performance on downstream vision tasks such as object detection and segmentation—while only adding a single loss term to the training pipeline.

Key Contributions

  • Cross‑attention distillation: Replaces the usual self‑attention alignment with a true cross‑attention mechanism, allowing each student pixel to attend to all teacher pixels.
  • Non‑local knowledge transfer: Captures long‑range spatial relationships that are often missed by conventional feature‑level distillation.
  • Lightweight integration: The method adds only one extra loss term, keeping the training overhead minimal compared with more complex attention‑guided approaches.
  • State‑of‑the‑art results: Empirically outperforms leading feature‑based and hybrid distillation techniques on standard object detection (e.g., COCO) and semantic segmentation (e.g., ADE20K) benchmarks.
  • Open‑source implementation: Code released on GitHub, facilitating reproducibility and rapid adoption.

Methodology

Traditional feature‑based distillation aligns teacher and student feature maps channel‑wise or via simple spatial pooling, treating each pixel independently. CanKD flips this paradigm:

  1. Feature extraction: The teacher and student networks produce feature maps of the same spatial resolution (or are resized to match).
  2. Cross‑attention module: For every location in the student map, a query vector is formed. This query attends to all locations in the teacher map (keys and values) using the standard scaled dot‑product attention formula.
  3. Non‑local loss: The attention‑weighted teacher features are compared to the original student features using an L₂ (or cosine) loss, encouraging the student to mimic the teacher’s global context.
  4. Training objective: The overall loss is the sum of the task‑specific loss (e.g., detection or segmentation loss) and the new cross‑attention distillation loss. No extra classifiers or adapters are required.

Because the attention operation is fully differentiable and can be implemented with existing deep‑learning primitives, the approach integrates seamlessly into typical training loops.

Results & Findings

TaskTeacher (large)Student (baseline)Student + CanKDΔ vs. baseline
Object detection (COCO)Faster R‑CNN ResNet‑101Faster R‑CNN ResNet‑50+2.3 AP+2.3 AP
Semantic segmentation (ADE20K)DeepLabV3+ X‑101DeepLabV3+ X‑50+1.8 mIoU+1.8 mIoU
Classification (ImageNet)ResNet‑152ResNet‑50+1.5 % top‑1+1.5 %
  • CanKD consistently beats prior attention‑guided distillation methods (e.g., AT, SPKD) by 0.5–1.0 AP/mIoU.
  • Training time overhead stays under 10 % because only one additional loss term is computed; memory usage grows modestly due to the attention matrix.
  • Ablation studies confirm that the cross‑attention (teacher‑to‑student) direction is the primary driver of gains, while self‑attention on the student side adds little benefit.

Practical Implications

  • Sharper lightweight models: Deployers can compress a high‑performing backbone (teacher) into a faster, smaller student without sacrificing much accuracy—critical for edge devices, AR/VR, and real‑time inference.
  • Plug‑and‑play distillation: Since CanKD only requires a loss function, it can be dropped into existing pipelines (detectron2, mmsegmentation, etc.) with minimal code changes.
  • Improved transfer learning: The richer, globally‑aware student features make fine‑tuning on downstream tasks more effective, potentially reducing the amount of labeled data needed.
  • Potential for multimodal extensions: The cross‑attention formulation naturally generalizes to scenarios where teacher and student operate on different modalities (e.g., RGB vs. depth), opening doors for cross‑modal distillation.

Limitations & Future Work

  • Scalability of attention: The full cross‑attention matrix scales quadratically with spatial resolution, which can become a bottleneck for very high‑resolution feature maps. The authors suggest exploring sparse or hierarchical attention to mitigate this.
  • Teacher‑student architecture mismatch: The method assumes comparable spatial dimensions; large mismatches may require additional resizing or projection layers, which could dilute the non‑local signal.
  • Broader task evaluation: Experiments focus on detection and segmentation; applying CanKD to video tasks, generative models, or reinforcement learning remains an open question.

Future research directions include efficient attention approximations, curriculum‑style distillation schedules, and extending the framework to multi‑teacher or self‑supervised settings.

Authors

  • Shizhe Sun
  • Wataru Ohyama

Paper Information

  • arXiv ID: 2511.21503v1
  • Categories: cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »