I built a CLI to fix the encoding/newline/whitespace noise that pollutes your diffs

Published: (March 10, 2026 at 03:31 PM EDT)
7 min read
Source: Dev.to

Source: Dev.to

Source: Dev.to

code‑normalizer‑pro

I built a CLI to fix the encoding / newline / whitespace noise that pollutes your diffs.

Every team I have worked on eventually hits the same invisible problem:

  • Someone on Windows commits a file.
  • Someone on macOS pulls it.
  • The diff shows hundreds of changed lines.

Nothing actually changed – it was trailing spaces, CRLF endings, a BOM, or a file that got re‑saved in a different encoding. The code review becomes useless because the real changes are buried in whitespace noise.

I got tired of fixing this manually on every project, so I built code‑normalizer‑pro – a CLI that handles all of it in one pass.

What It Does

A single command normalizes an entire directory:

  • Convert encoding to UTF‑8 (supports UTF‑16, UTF‑8‑BOM, Windows‑1252, Latin‑1, etc.)
  • Convert line endings from CRLF → LF
  • Strip trailing whitespace from every line
  • Ensure a single newline at the end of each file
  • Automatically skip binary files

Supported file types include Python, JavaScript, TypeScript, Go, Rust, C, C++, and Java.


Installation

pip install code-normalizer-pro
  • Requires Python 3.10+.
  • Core has zero dependencies beyond tqdm for progress bars.

Basic usage

# See what would change without touching anything
code-normalizer-pro /path/to/project --dry-run

# Fix everything in‑place
code-normalizer-pro /path/to/project --in-place

# Specific extensions only
code-normalizer-pro /path/to/project -e .py -e .js --in-place

Dry‑run output example

Scanning /path/to/project...
  [changed] src/utils.py      -- trailing whitespace (34 chars), CRLF endings
  [changed] src/main.js       -- encoding: windows-1252 → utf-8
  [skip]    assets/logo.png   -- binary
  [ok]      tests/test_core.py

Total: 47 files | 2 changed | 1 skipped | 44 already clean

In dry‑run mode nothing is written; the output shows exactly what would happen.

Parallel Processing for Large Codebases

Sequential mode processes about 20–30 files per second – fine for a typical repository.
For anything over a few thousand files, switch to parallel mode:

code-normalizer-pro /path/to/project --parallel --in-place

By default it uses all available CPU cores. To limit the number of workers:

code-normalizer-pro /path/to/project --parallel --workers 4 --in-place

Benchmarks (Python codebase, ~200 lines/file)

Mode100 files500 files1 000 files
Sequential3.2 s16.8 s33.5 s
Parallel (4 workers)1.1 s4.3 s7.1 s
Parallel (8 workers)0.8 s2.9 s4.8 s

SHA‑256 Caching for Repeat Runs

  • First run – processes every file and creates a .normalize-cache.json file.
  • Subsequent runs – skip unchanged files entirely.
code-normalizer-pro /path/to/project --cache --in-place --parallel

Second‑Run Output Example

All discovered files were unchanged and skipped by cache.
Cached hits: 1000
Total runtime: 0.8s

This dramatically speeds up CI: a normalization step that touches only a handful of files can finish in under a second instead of tens of seconds.

Pre‑commit Hook

Enforce standards across the team with a Git hook:

# Run once inside any git repo
code-normalizer-pro --install-hook

The command writes a pre‑commit hook that checks staged files before every commit. If any file needs normalisation, the commit is blocked and the fix command is printed:

Checking 5 staged file(s)...

Files that need normalization:
  src/feature.py
  src/utils.js

Run: code-normalizer-pro src/feature.py src/utils.js --in-place
Or:  git commit --no-verify  (to skip this check)
  1. The developer fixes the files.
  2. Re‑stage the changes.
  3. Commit again.

No extra configuration file is required. The hook uses the Python interpreter that installed the package, so it works in virtual environments without additional setup.

CI Integration

Add a normalization check to your pipeline in about 10 lines.

GitHub Actions Example

name: Code hygiene check
on: [push, pull_request]

jobs:
  normalize-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install code-normalizer-pro
      - run: code-normalizer-pro . --dry-run --parallel
  • The --dry-run flag currently exits 0 regardless of violations.
  • A --fail-on-changes flag is on the roadmap – for now you can grep the output for the word “changed” and exit 1 accordingly.

I will document this workaround in the README until the flag ships.

Interactive mode

If you are normalising a codebase for the first time and want to review each change before it is written, run:

code-normalizer-pro /path/to/project --interactive

The tool shows a diff for each file and waits for your input:

  • y – accept the change
  • n – reject the change
  • d – show the full diff
  • q – quit

This is useful when you are unsure what you are about to change in a legacy codebase.


What I learned building this

  • Encoding detection is hard. UTF‑16 files without a BOM are indistinguishable from binary garbage unless you apply heuristic analysis. I ended up with a layered approach: check for a BOM first, then try a candidate list in order, then fall back to binary detection. There are still edge cases.

  • ProcessPoolExecutor and in‑place writes need careful handling. When you spawn worker processes that rewrite files, you must avoid race conditions and ensure that a file is not being read while another worker is writing it. I solved this by having each worker write to a temporary file and then atomically replace the original after the worker finishes.

  • Progress reporting across processes. tqdm works nicely with a shared tqdm instance wrapped by a multiprocessing.Manager queue, giving a smooth unified progress bar.

  • Cache invalidation. Using a SHA‑256 hash of the file’s raw bytes (after normalisation) provides a reliable cache key. If the file changes, the hash changes and the cache entry is refreshed automatically.


Feel free to try it out, open issues, or contribute! 🎉

Workers, Backup Creation, and Cache Path

  • Backup creation must occur before dispatch – not inside the worker.
    If it’s done inside the worker, parallel mode silently skips backups. This is a known bug in the current release that I’m fixing next.

  • Cache path matters. The cache file should reside next to the target directory, not in the current working directory (CWD).
    Running the tool from different working directories prevents the cache from being hit. This is also on the fix list.

Current State and Roadmap

Version: v3.0.1‑alpha.1
It works and I use it on my own projects daily.

Rough edges I’m actively fixing

  • --parallel --in-place skips backups (data‑safety issue – high priority)
  • Cache file lands in the current working directory instead of the target directory
  • --dry-run exits with status 0 even when violations are found (need --fail-on-changes)
  • No --version flag yet

Coming next

  • .gitignore pattern support (skip files the project already ignores)
  • --git-staged mode (normalize only what is staged, like the pre‑commit hook does)
  • --fail-on-changes for CI
  • --version flag

Try It and Tell Me What’s Missing

pip install code-normalizer-pro
code-normalizer-pro . --dry-run

The two things I’d like to hear from anyone who tries it

  1. CI compatibility – Does it work in your CI pipeline? If it breaks anything, please describe exactly what happened.
  2. Missing language or workflow – Which language or workflow isn’t supported yet that would make this tool useful for you?

The source code is on GitHub. If you encounter a bug or have a feature request, open an issue – I respond to everything.

Built with Python 3.10+. Zero external dependencies except tqdm.
MIT license.

0 views
Back to Blog

Related posts

Read more »

A new tool I built: Crashvault

!Cover image for A new tool I built: Crashvaulthttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to...