Behind the scenes at Perl School Publishing
Source: Dev.to
We’ve just published a new Perl School book: Design Patterns in Modern Perl by Mohammad Sajid Anwar.
It’s been a while since we last released a new title, and in the meantime the world of eBooks has moved on – Amazon no longer uses .mobi, tools have changed, and my old “it mostly works if you squint” build pipeline was starting to creak.
On top of that we had a hard deadline: we wanted the book ready in time for the London Perl Workshop. As the date loomed, last‑minute fixes and manual tweaks became more and more terrifying. We really needed a reliable, reproducible way to go from manuscript to “good quality PDF + EPUB” every time.
So over the last couple of weeks I’ve been rebuilding the Perl School book pipeline from the ground up. This post is the story of that process, the tools I ended up using, and how you can steal it for your own books.
The old world, and why it wasn’t good enough
The original Perl School pipeline dates back to a very different era:
- Amazon wanted
.mobifiles. - EPUB support was patchy.
- I was happy to glue things together with shell scripts and hope for the best.
It worked… until it didn’t. Each book had slightly different scripts, slightly different assumptions, and a slightly different set of last‑minute manual tweaks. It certainly wasn’t something I’d hand to a new author and say, “trust this”.
Coming back to it for Design Patterns in Modern Perl made that painfully obvious. The book itself is modern and well‑structured; the pipeline that produced it shouldn’t feel like a relic.
Choosing tools: Pandoc and wkhtmltopdf (and no LaTeX, thanks)
The new pipeline is built around two main tools:
- Pandoc – the Swiss Army knife of document conversion. It can take Markdown/Markua plus metadata and produce HTML, EPUB, and much, much more.
wkhtmltopdf– which turns HTML into a print‑ready PDF using a headless browser engine.
Why not LaTeX? Because I’m allergic. LaTeX is enormously powerful, but every time I’ve tried to use it seriously I end up debugging page breaks in a language I don’t enjoy. HTML + CSS I can live with; browsers I can reason about.
Conversion flow
PDF route
Markdown → HTML (via Pandoc) → PDF (via wkhtmltopdf)
EPUB route
Markdown → EPUB (via Pandoc) → validated with epubcheck
The front matter (cover page, title page, copyright, etc.) is generated with Template Toolkit from a simple book-metadata.yml file, then stitched together with the chapters to produce a nice, consistent book.
That got us a long way… but then a reader found a bug.
The iBooks bug report
Shortly after publication, a reader who bought the Leanpub EPUB and was reading it in Apple Books (iBooks) saw a big pink error box:
There’s something wrong with the XHTML in this EPUB.
Apple Books is quite strict about the “X” in XHTML: it expects well‑formed XML, not just “kind of valid HTML”. When working with EPUB you need to forget the HTML5 flexibility you’ve grown used to.
Discovering epubcheck
epubcheck is the reference validator for EPUB files. Point it at an .epub and it will unpack it, parse all the XML/XHTML, check the metadata and manifest, and tell you exactly what’s wrong.
Running it on the book immediately produced:
Fatal Error while parsing file: The element type `br` must be terminated by the matching end-tag `</br>`.
In HTML <br> is fine; in XHTML (which is XML) you must use <br/> (self‑closing) or <br></br>. A number of these appeared across a few chapters. Pandoc had passed raw HTML straight through into the EPUB, but that HTML was not strictly valid XHTML, so Apple Books rejected it.
A quick (but not scalable) fix
Under time pressure the quickest way to confirm the diagnosis was:
- Unzip the generated EPUB.
- Open the offending XHTML file.
- Manually change
<br>to<br/>in a couple of places. - Re‑zip the EPUB.
- Run
epubcheckagain. - Try it in Apple Books.
The errors vanished, epubcheck was happy, and the reader confirmed the fixed file opened fine. However, “open the EPUB in a text editor and fix the XHTML by hand” is not a sustainable publishing strategy.
HTML vs XHTML, and why linters matter
The underlying issue is straightforward:
- HTML is very forgiving; browsers will fix broken markup.
- XHTML is XML, so it’s not forgiving. EPUB 3 content files are XHTML; sloppy HTML will cause some readers (like Apple Books) to refuse to load the chapter.
I added a manuscript HTML linter to the toolchain, before we ever get to Pandoc or epubcheck.
Roughly, the linter:
- Reads the manuscript (ignoring fenced code blocks so it doesn’t complain about
<in Perl examples). - Extracts any raw HTML chunks.
- Wraps those chunks in a temporary root element.
- Uses
XML::LibXMLto check they’re well‑formed XML. - Reports any errors with file and line number.
It’s not a full HTML validator; it simply asks, “If this HTML ends up in an EPUB, will the XML parser choke?” This would have caught the <br> problem before the book ever left my machine.
Hardening the pipeline: epubcheck in the loop
The linter catches obvious issues in the manuscript; epubcheck remains the final authority on the finished EPUB.
The pipeline now looks like this:
- Lint the manuscript HTML – catch broken raw HTML/XHTML before conversion.
- Build PDF + EPUB via
make_book. - Run
epubcheckon the EPUB – ensure the final file is standards‑compliant. - Only then upload to Leanpub and Amazon.
Any future changes (new CSS, new template, different metadata) still go through the same gauntlet, and the pipeline shouts at me long before a reader has to.
Docker and GitHub Actions: making it reproducible
Having a nice Perl script and a list of tools installed on my laptop is fine for a solo project; it’s not great if:
- other authors might want to build their own drafts, or
- I want the build to happen automatically in CI.
The next step was to package everything into a Docker image and wire it into GitHub Actions.
Docker image contents
- Perl +
cpanm+ all CPAN modules from the repo’scpanfile pandocwkhtmltopdf- Java +
epubcheck - The Perl School utility scripts themselves (
make_book,check_ms_html, etc.)
Typical workflow in a book repo
# Mount the book’s Git repo into /work
docker run --rm -v "$(pwd)":/work perl-school-builder \
perl check_ms_html # lint the manuscript
docker run --rm -v "$(pwd)":/work perl-school-builder \
perl make_book # build built/*.pdf and built/*.epub
docker run --rm -v "$(pwd)":/work perl-school-builder \
java -jar /usr/local/epubcheck/epubcheck.jar built/*.epub
With everything containerized and automated, any author can reproduce the exact same build, and the CI pipeline guarantees that only standards‑compliant PDFs and EPUBs are ever published.