[Paper] Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries

Published: (January 2, 2026 at 01:46 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.00787v1

Overview

This study investigates whether transformer‑based natural language processing (NLP) models trained on pathology reports in one Canadian province can be efficiently adapted for use in another province with different reporting styles. By fine‑tuning two state‑of‑the‑art models on a modest amount of local data, the authors show that cross‑jurisdictional deployment is feasible and can dramatically cut the number of missed cancer cases in registry workflows.

Key Contributions

  • First cross‑provincial benchmark of transformer NLP models for cancer‑registry tasks in Canada.
  • Adaptation pipeline that fine‑tunes a province‑specific model (BCCRTron) and a generic biomedical model (GatorTron) using only a few thousand de‑identified reports.
  • Dual‑task evaluation: Tier 1 (cancer vs. non‑cancer) and Tier 2 (reportable vs. non‑reportable) classification.
  • Conservative OR‑ensemble that merges predictions from both models, boosting recall to 0.99 and halving missed cancers compared with each model alone.
  • Privacy‑preserving sharing of model weights only (no raw patient text), paving the way for a pan‑Canadian foundation model for pathology NLP.

Methodology

  1. Data collection – The Newfoundland & Labrador Cancer Registry (NLCR) contributed ~104 k pathology reports for Tier 1 and ~22 k for Tier 2, all de‑identified.
  2. Model selection
    • BCCRTron: a transformer already fine‑tuned on British Columbia Cancer Registry data.
    • GatorTron: a large biomedical transformer pretrained on PubMed‑style text.
  3. Input pipelines – Two parallel preprocessing streams were built: one that extracts the synoptic (structured) sections of reports, and another that focuses on the free‑text diagnosis narrative.
  4. Fine‑tuning – Each model was further trained on the NLCR data for a small number of epochs (≈ 3–5), using standard cross‑entropy loss and early stopping.
  5. Ensembling – A simple OR‑logic was applied: a report is flagged as cancer (or reportable) if either model predicts the positive class. This conservative strategy maximizes sensitivity.
  6. Evaluation – Performance was measured on held‑out NLCR test sets using recall, precision, and F1‑score, with a particular focus on missed cancer cases (false negatives).

Results & Findings

TaskModelRecallMissed cancers (Tier 1)Missed reportable (Tier 2)
Tier 1 (cancer vs. non‑cancer)BCCRTron0.9548
Tier 1GatorTron0.9654
Tier 1OR‑Ensemble0.9924
Tier 2 (reportable vs. non‑reportable)BCCRTron0.9654
Tier 2GatorTron0.9546
Tier 2OR‑Ensemble0.9933
  • Both models retained high performance after only modest fine‑tuning, confirming that a transformer pretrained in one jurisdiction can be localized elsewhere.
  • The ensemble consistently outperformed each individual model, especially in recall, which is critical for cancer surveillance where missing a case can have serious downstream effects.

Practical Implications

  • Rapid deployment: Health jurisdictions can adopt an existing transformer (e.g., a provincial model) and achieve near‑state‑of‑the‑art performance with a few thousand local reports, avoiding the need to train from scratch.
  • Reduced manual workload: Higher recall means fewer cases slip through to manual review, allowing registry staff to focus on edge cases rather than re‑checking obvious cancers.
  • Inter‑provincial collaboration: Sharing only model weights respects privacy regulations while enabling a shared NLP infrastructure, potentially leading to a national foundation model for pathology extraction.
  • Ensemble pattern: The conservative OR‑ensemble is a low‑cost, high‑impact technique that can be applied to any multi‑model setup where missing a positive case is costly.
  • Integration hooks: The dual‑pipeline (synoptic + diagnosis) design maps cleanly onto existing ETL workflows in hospital information systems, making integration straightforward for developers.

Limitations & Future Work

  • Data diversity: The study focused on two provinces; additional jurisdictions with more heterogeneous report formats may expose edge cases not captured here.
  • Model size vs. latency: Large biomedical transformers can be computationally heavy; future work should explore distillation or quantization for real‑time deployment.
  • Explainability: While recall improved, the paper does not delve into model interpretability, which is important for clinical trust.
  • Pan‑Canadian foundation model: The authors propose a shared model but have not yet demonstrated training at that scale; future research will need to address federated learning or secure multi‑party computation to truly unify data across provinces.

Authors

  • Jonathan Simkin
  • Lovedeep Gondara
  • Zeeshan Rizvi
  • Gregory Doyle
  • Jeff Dowden
  • Dan Bond
  • Desmond Martin
  • Raymond Ng

Paper Information

  • arXiv ID: 2601.00787v1
  • Categories: cs.CL
  • Published: January 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »