[Paper] Deep Learning-Based Multiclass Classification of Oral Lesions with Stratified Augmentation

Published: (November 26, 2025 at 11:56 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21582v1

Overview

The paper presents a deep‑learning pipeline that can automatically differentiate 16 distinct oral lesion types—ranging from harmless ulcers to malignant cancers—using only photographic images. By tackling the classic problems of tiny, highly imbalanced medical datasets with a clever mix of stratified splitting, aggressive augmentation, and oversampling, the authors push classification accuracy to 83 %, outperforming existing computer‑aided diagnosis (CAD) solutions.

Key Contributions

  • Multi‑class oral lesion classifier covering 16 categories, the most granular set reported to date.
  • Stratified data split that preserves the original class distribution across training/validation/test sets, reducing leakage and bias.
  • Hybrid augmentation & oversampling pipeline (rotation, scaling, color jitter, SMOTE‑like synthetic sampling) specifically tuned for minority lesion classes.
  • Empirical benchmark showing 83.33 % accuracy, 89.12 % precision, and 77.31 % recall—substantially higher than prior state‑of‑the‑art CNN baselines.
  • Open‑source implementation (code and trained weights) to accelerate reproducibility and downstream research.

Methodology

  1. Dataset preparation – A curated collection of intra‑oral photographs (≈2 k images) labeled into 16 lesion categories. The authors first performed a stratified split (70/15/15 % for train/val/test) to keep each class’s proportion identical across sets.
  2. Pre‑processing & augmentation – Standard resizing to 224 × 224 px, followed by a heavy augmentation suite: random rotations (±30°), horizontal/vertical flips, brightness/contrast jitter, and elastic deformations. This inflates the effective training set by ~10×.
  3. Oversampling of minority classes – After augmentation, the authors applied a SMOTE‑style synthetic oversampling on feature embeddings to further balance the class frequencies without over‑fitting to duplicated images.
  4. Model architecture – A pretrained ResNet‑50 backbone (ImageNet weights) fine‑tuned with a custom fully‑connected head (softmax over 16 classes). Transfer learning speeds convergence and leverages generic visual features.
  5. Training regime – Cross‑entropy loss with class‑weighted penalties (higher weight for rare lesions), Adam optimizer, learning‑rate cosine annealing, and early stopping based on validation loss.
  6. Evaluation – Standard metrics (accuracy, precision, recall, F1) computed per‑class and macro‑averaged, plus confusion‑matrix analysis to spot systematic misclassifications.

Results & Findings

MetricValue
Overall Accuracy83.33 %
Macro‑averaged Precision89.12 %
Macro‑averaged Recall77.31 %
F1‑score (average)0.82
  • Minority class boost: Recall for the three rarest lesions jumped from <50 % (baseline CNN) to >70 % after the augmentation‑oversampling combo.
  • Confusion patterns: Most errors occurred between visually similar precancerous lesions (e.g., leukoplakia vs. erythroplakia), suggesting that further domain‑specific cues (e.g., texture descriptors) could help.
  • Ablation study: Removing stratified splitting reduced test accuracy by ~4 %; dropping oversampling cut recall for minority classes by ~12 %, confirming each component’s necessity.

Practical Implications

  • Early screening tools: Dentists could use a smartphone‑based app that runs the model locally to flag suspicious lesions during routine exams, prompting timely biopsies.
  • Tele‑medicine triage: Remote clinics in low‑resource settings can upload images to a cloud service powered by this model, receiving a quick multi‑class risk assessment without needing a specialist on‑site.
  • Dataset‑agnostic workflow: The stratified augmentation framework is transferable to other medical imaging domains (skin lesions, retinal disease) where class imbalance is a chronic hurdle.
  • Regulatory pathway: By demonstrating robust performance across 16 classes and providing transparent preprocessing steps, the work lays groundwork for FDA/CE‑marked CAD devices that require explainability and reproducibility.

Limitations & Future Work

  • Dataset size & diversity: The study relies on a single, relatively small image collection sourced from a limited number of clinics; broader geographic and demographic sampling is needed to ensure generalization.
  • Clinical validation: No prospective trial was conducted; real‑world sensitivity/specificity may differ when images are captured by non‑expert users or under suboptimal lighting.
  • Explainability: The current model outputs only class probabilities; integrating saliency maps or attention mechanisms would help clinicians trust the AI’s decisions.
  • Extension to multimodal data: Combining visual cues with patient metadata (age, smoking status) could further boost diagnostic accuracy, a direction the authors plan to explore.

Authors

  • Joy Naoum
  • Revana Salama
  • Ali Hamdi

Paper Information

  • arXiv ID: 2511.21582v1
  • Categories: cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »