I built a CLI to fix the encoding/newline/whitespace noise that pollutes your diffs

Published: (March 10, 2026 at 03:31 PM EDT)
7 min read
Source: Dev.to

Source: Dev.to

code‑normalizer‑pro

I built a CLI to fix the encoding / newline / whitespace noise that pollutes your diffs.

Every team I have worked on eventually hits the same invisible problem:

  • Someone on Windows commits a file.
  • Someone on macOS pulls it.
  • The diff shows hundreds of changed lines.

Nothing actually changed – it was trailing spaces, CRLF endings, a BOM, or a file that got re‑saved in a different encoding. The code review becomes useless because the real changes are buried in whitespace noise.

I got tired of fixing this manually on every project, so I built code‑normalizer‑pro – a CLI that handles all of it in one pass.


What it does

One command normalizes an entire directory:

  • Converts encoding to UTF‑8 (handles UTF‑16, UTF‑8‑BOM, Windows‑1252, Latin‑1, and more)
  • Fixes line endings – CRLF → LF
  • Strips trailing whitespace from every line
  • Ensures a single newline at end of file
  • Skips binary files automatically

It works on Python, JavaScript, TypeScript, Go, Rust, C, C++, and Java files.


Install

pip install code-normalizer-pro

Requires Python 3.10+. Core has zero dependencies beyond tqdm for progress bars.


Basic usage

# See what would change without touching anything
code-normalizer-pro /path/to/project --dry-run

# Fix everything in‑place
code-normalizer-pro /path/to/project --in-place

# Specific extensions only
code-normalizer-pro /path/to/project -e .py -e .js --in-place

Dry‑run output example

Scanning /path/to/project...
  [changed] src/utils.py      -- trailing whitespace (34 chars), CRLF endings
  [changed] src/main.js       -- encoding: windows-1252 → utf-8
  [skip]    assets/logo.png   -- binary
  [ok]      tests/test_core.py

Total: 47 files | 2 changed | 1 skipped | 44 already clean

Nothing is written in dry‑run mode; you see exactly what would happen.


Parallel processing for large codebases

Sequential mode processes about 20–30 files per second – fine for a typical repo.
For anything over a few thousand files, use parallel mode:

code-normalizer-pro /path/to/project --parallel --in-place

It uses all available CPU cores by default. You can cap the workers:

code-normalizer-pro /path/to/project --parallel --workers 4 --in-place

Benchmarks (Python codebase, ~200 lines/file)

Mode100 files500 files1000 files
Sequential3.2 s16.8 s33.5 s
Parallel (4 workers)1.1 s4.3 s7.1 s
Parallel (8 workers)0.8 s2.9 s4.8 s

SHA‑256 caching for repeat runs

  • First run processes everything and writes a .normalize-cache.json file.
  • Subsequent runs skip unchanged files entirely.
code-normalizer-pro /path/to/project --cache --in-place --parallel

Second‑run output example

All discovered files were unchanged and skipped by cache.
Cached hits: 1000
Total runtime: 0.8s

This speeds up CI dramatically – a normalisation step that touches only a handful of files can finish in under a second instead of tens of seconds.


Pre‑commit hook

Enforce standards across the team with a Git hook:

# Run once inside any git repo
code-normalizer-pro --install-hook

The command writes a pre‑commit hook that checks staged files before every commit. If any file needs normalisation, the commit is blocked and the fix command is printed:

Checking 5 staged file(s)...

Files that need normalization:
  src/feature.py
  src/utils.js

Run: code-normalizer-pro src/feature.py src/utils.js --in-place
Or:  git commit --no-verify  (to skip this check)

The developer fixes the files, re‑stages, and commits. No extra config file is required. The hook uses the Python interpreter that installed the package, so it works in virtualenvs without additional setup.


CI integration

Add a normalisation check to your pipeline in about 10 lines.

GitHub Actions example

name: Code hygiene check
on: [push, pull_request]

jobs:
  normalize-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install code-normalizer-pro
      - run: code-normalizer-pro . --dry-run --parallel

The --dry-run flag currently exits 0 regardless of violations. A --fail-on-changes flag is on the roadmap – for now you can grep the output for “changed” and exit 1 accordingly. I will document the workaround in the README until the flag ships.


Interactive mode

If you are normalising a codebase for the first time and want to review each change before it is written:

code-normalizer-pro /path/to/project --interactive

It shows a diff for each file and waits for y / n / d (show full diff) / q (quit). This is useful when you are unsure what you are about to change in a legacy codebase.


What I learned building this

  • Encoding detection is hard. UTF‑16 files without a BOM are indistinguishable from binary garbage unless you apply heuristic analysis. I ended up with a layered approach – check for a BOM first, then try a candidate list in order, then fall back to binary detection. There are still edge cases.

  • ProcessPoolExecutor and in‑place writes need careful handling. When you spawn worker processes that rewrite files, you must avoid race conditions and ensure that a file is not being read while another worker is writing it. I solved this by having each worker write to a temporary file and then atomically replace the original after the worker finishes.

  • Progress reporting across processes. tqdm works nicely with a shared tqdm instance wrapped by a multiprocessing.Manager queue, giving a smooth unified progress bar.

  • Cache invalidation. Using a SHA‑256 hash of the file’s raw bytes (after normalisation) provides a reliable cache key. If the file changes, the hash changes and the cache entry is refreshed automatically.


Feel free to try it out, open issues, or contribute! 🎉

Workers, Backup Creation, and Cache Path

  • Backup creation must happen before dispatch – not inside the worker.
    Otherwise, parallel mode silently skips backups. This is a known bug in the current release that I am fixing next.

  • Cache path matters. The cache file should live next to the target directory, not in the CWD.
    If you run the tool from a different working directory each time, the cache never hits. This is also on the fix list.


Current State and Roadmap

Version: v3.0.1‑alpha.1
It works and I use it on my own projects daily.

Rough edges I’m actively fixing

  • --parallel --in-place skips backups (data‑safety issue – high priority)
  • Cache file lands in CWD instead of the target directory
  • --dry-run exits 0 even when violations are found (need --fail-on-changes)
  • No --version flag yet

Coming next

  • .gitignore pattern support (skip files the project already ignores)
  • --git-staged mode (normalize only what is staged, like the pre‑commit hook does)
  • --fail-on-changes for CI
  • --version flag

Try It and Tell Me What Is Missing

pip install code-normalizer-pro
code-normalizer-pro . --dry-run

The two things I want to know from anyone who tries it

  1. CI compatibility: Does it work in your CI setup? If it broke something, I want to know exactly how.
  2. Missing language or workflow: What language or workflow is missing that would make this useful to you?

Source is on GitHub:

If you hit a bug or have a feature request, open an issue. I respond to everything.

Built with Python 3.10+. Zero required external dependencies except tqdm.
MIT license.

0 views
Back to Blog

Related posts

Read more »

A new tool I built: Crashvault

!Cover image for A new tool I built: Crashvaulthttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to...

Introducing Attune.js

!Cover image for Introducing Attune.jshttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads....