When Two npm Packages Fight Over pdfjs-dist: Drop to System Binaries

Published: (March 16, 2026 at 03:13 PM EDT)
4 min read
Source: Dev.to

Source: Dev.to

The Problem

I was adding OCR support for scanned PDFs to a Next.js app. The plan was simple:

  1. Use pdf-to-img to rasterize PDF pages.
  2. Pipe the images to Tesseract.
  3. Concatenate the extracted text.

After installing pdf-to-img and deploying to pre‑prod, uploading a scanned PDF produced the error:

Error: API version does not match Worker version

Why It Happened

The project already used unpdf for extracting text from digital PDFs. Both pdf-to-img and unpdf bundle their own copies of pdfjs-dist, but they use different versions:

PackageBundled pdfjs‑dist version
pdf-to-img~5.4.624
unpdf~5.4.296

When both packages are loaded in the same Node.js process, each tries to register its own PDF.js worker. The workers clash, and PDF.js throws the “API version does not match Worker version” error. Because the bundled copies are not peer dependencies, npm deduplication cannot resolve the conflict.

Dead‑End Alternatives

I explored other JavaScript‑based PDF‑to‑image solutions:

  • pdfjs-dist directly (still locked to the version required by unpdf)
  • canvas + manual PDF.js rendering (requires native bindings and a complex Docker setup)
  • sharp (cannot rasterize PDFs)
  • pdf-poppler (poorly maintained wrapper)

All of these either re‑introduced the same pdfjs‑dist conflict, required heavy native builds, or were abandoned.

The Better Solution: Use System Binaries

The task of converting PDFs to images and performing OCR is a solved problem at the OS level. Tools like poppler-utils (pdftoppm) and tesseract-ocr are stable, fast, and battle‑tested.

Install the binaries

RUN apt-get update && apt-get install -y \
    poppler-utils \
    tesseract-ocr \
    tesseract-ocr-eng \
    && rm -rf /var/lib/apt/lists/*

OCR pipeline implementation

import { execSync } from "child_process";
import * as fs from "fs";
import * as path from "path";
import * as os from "os";

async function ocrScannedPdf(pdfPath: string): Promise {
  const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "ocr-"));
  const outputPrefix = path.join(tmpDir, "page");

  try {
    // Convert PDF pages to PNG images (300 DPI, good for OCR accuracy)
    execSync(`pdftoppm -png -r 300 "${pdfPath}" "${outputPrefix}"`, {
      timeout: 60000,
    });

    // Collect generated images
    const images = fs
      .readdirSync(tmpDir)
      .filter((f) => f.endsWith(".png"))
      .sort()
      .map((f) => path.join(tmpDir, f));

    if (images.length === 0) {
      throw new Error("pdftoppm produced no output");
    }

    // Run Tesseract on each page
    const texts = images.map((imgPath) => {
      const result = execSync(`tesseract "${imgPath}" stdout -l eng`, {
        timeout: 30000,
      });
      return result.toString().trim();
    });

    return texts.filter(Boolean).join("\n\n");
  } finally {
    // Clean up temporary files
    fs.rmSync(tmpDir, { recursive: true, force: true });
  }
}

Key points

  • No npm packages are required for the conversion or OCR steps.
  • No version conflicts because the system binaries are independent of Node.js modules.
  • The entire pipeline is ~20 lines of code.

Why Prefer System Binaries Over npm Wrappers

When an npm package merely wraps a system binary (e.g., ImageMagick, FFmpeg, Ghostscript, Poppler, Tesseract, wkhtmltopdf):

  1. Check maintenance – Is the wrapper well‑maintained, or is it a thin shim?
  2. Watch for transitive conflicts – Does the wrapper bundle its own copy of a library that could clash with other dependencies?
  3. Consider Docker simplicity – Installing the binary directly often results in a cleaner Dockerfile.

The npm ecosystem shines for pure‑JavaScript problems. For tasks that have long‑standing, high‑performance native implementations, invoking the binary directly is usually the more reliable choice.

Takeaways

  • Avoid bundled pdfjs‑dist conflicts by not loading multiple packages that embed different versions.
  • Leverage OS‑level tools (pdftoppm, tesseract) for PDF rasterization and OCR.
  • Keep the Node.js layer thin: a few execSync calls and minimal code can replace heavyweight, conflict‑prone npm wrappers.
  • Simplify Docker images by installing the required binaries directly rather than pulling in large, fragile npm wrappers.
0 views
Back to Blog

Related posts

Read more »