[Paper] Multi-Level Feature Fusion for Continual Learning in Visual Quality Inspection
Source: arXiv - 2601.00725v1
Overview
The paper tackles a real‑world bottleneck in automated visual quality inspection: how to keep a deep‑learning model up‑to‑date when products, defect types, or manufacturing lines keep changing. By framing this as a continual‑learning problem, the authors propose a Multi‑Level Feature Fusion (MLFF) technique that re‑uses a frozen pretrained backbone and only learns lightweight adapters that combine features from several depths. The result is a system that adapts quickly, uses far fewer trainable parameters, and mitigates catastrophic forgetting—making it far more practical for production environments.
Key Contributions
- Multi‑Level Feature Fusion (MLFF) architecture that aggregates representations from shallow, intermediate, and deep layers of a frozen pretrained CNN.
- Parameter‑efficient adaptation: only a small set of fusion weights and task‑specific heads are trained, cutting trainable parameters by up to 90 % compared with full fine‑tuning.
- Robust continual‑learning pipeline: demonstrated reduced forgetting and better generalization when new product lines or defect patterns are introduced.
- Empirical validation on multiple inspection datasets (e.g., surface‑defect detection, component mis‑alignment) showing parity with end‑to‑end training while being far more computationally lightweight.
- Open‑source implementation (released alongside the paper) that integrates with popular frameworks such as PyTorch and TensorFlow.
Methodology
- Pretrained Backbone – A standard CNN (e.g., ResNet‑50) is trained once on a large, generic visual dataset and then frozen.
- Feature Extraction at Multiple Depths – The network’s output after selected blocks (early, middle, late) is taken as a set of feature maps.
- Fusion Layer – A lightweight trainable module (typically a 1×1 convolution followed by a global‑average pooling) learns to weight and combine these multi‑scale features into a single descriptor.
- Task‑Specific Head – For each inspection task (new product type or defect class) a small classifier/regressor is attached to the fused descriptor.
- Continual‑Learning Loop – When a new batch of labeled images arrives, only the fusion layer and the new head are optimized (using a few epochs and a modest learning rate). The frozen backbone stays untouched, which prevents the drift that causes catastrophic forgetting.
- Regularization – Optional knowledge‑distillation loss between the current fused representation and the previous one further stabilizes performance across tasks.
The whole pipeline can be run on a single GPU in minutes, even for datasets with tens of thousands of images.
Results & Findings
| Scenario | Baseline (full fine‑tune) | MLFF (fusion only) | Parameter Reduction | Forgetting (Δ mAP) |
|---|---|---|---|---|
| Surface defect detection (3 product types) | 94.2 % | 93.8 % | ~92 % fewer trainable params | +2.1 % (less drop) |
| Component mis‑alignment (5 successive batches) | 88.5 % | 88.1 % | ~89 % fewer trainable params | +3.4 % |
| Cross‑product generalization (unseen product) | 81.0 % | 80.7 % | — | +4.0 % |
- Performance parity: Across all benchmarks, MLFF stays within 0.5 % of the full‑network fine‑tuning accuracy.
- Speed & compute: Training a new task takes ~5 minutes on an RTX 3080 vs. ~45 minutes for full fine‑tuning.
- Catastrophic forgetting: The drop in mean average precision (mAP) after adding new tasks is consistently lower for MLFF, confirming its stability.
- Robustness to domain shift: When evaluated on completely new product families, the fused features generalize better than a single deep layer, likely because shallow layers retain texture‑level cues that are invariant across products.
Practical Implications
- Rapid model rollout – Factories can deploy a base inspection model and then “plug‑in” new defect detectors in hours rather than days, keeping line downtime minimal.
- Edge‑friendly deployment – Since the backbone stays frozen, only a small set of fusion weights and heads need to be stored on edge devices, reducing memory footprints and OTA update sizes.
- Cost‑effective scaling – Companies can maintain a single shared backbone across many product lines, avoiding the need to train and store a full model per line.
- Regulatory compliance – The deterministic nature of freezing the backbone simplifies audit trails; only the lightweight adapters change, making version control and validation easier.
- Cross‑domain reuse – The same pretrained backbone can be reused for other visual tasks (e.g., surface‑roughness measurement, bin‑picking) by simply adding new fusion heads, accelerating R&D cycles.
Limitations & Future Work
- Dependence on a strong pretrained backbone – If the initial backbone is not well‑aligned with the target domain (e.g., highly specialized materials), the fused features may still lack discriminative power.
- Limited exploration of non‑CNN backbones – The study focuses on ResNet‑style architectures; applying MLFF to Vision Transformers or hybrid models remains an open question.
- Scalability of many heads – While each new task adds only a small head, a very large number of tasks could eventually strain memory on constrained edge hardware.
Future directions suggested by the authors include:
- Dynamic selection of which layers to fuse based on task similarity.
- Integration with unsupervised domain adaptation to further reduce labeling effort.
- Extending the approach to multi‑modal inspection (e.g., combining visual and thermal data).
Authors
- Johannes C. Bauer
- Paul Geng
- Stephan Trattnig
- Petr Dokládal
- Rüdiger Daub
Paper Information
- arXiv ID: 2601.00725v1
- Categories: cs.CV
- Published: January 2, 2026
- PDF: Download PDF