I built a CLI to fix the encoding/newline/whitespace noise that pollutes your diffs
Source: Dev.to
code‑normalizer‑pro
I built a CLI to fix the encoding / newline / whitespace noise that pollutes your diffs.
Every team I have worked on eventually hits the same invisible problem:
- Someone on Windows commits a file.
- Someone on macOS pulls it.
- The diff shows hundreds of changed lines.
Nothing actually changed – it was trailing spaces, CRLF endings, a BOM, or a file that got re‑saved in a different encoding. The code review becomes useless because the real changes are buried in whitespace noise.
I got tired of fixing this manually on every project, so I built code‑normalizer‑pro – a CLI that handles all of it in one pass.
What it does
One command normalizes an entire directory:
- Converts encoding to UTF‑8 (handles UTF‑16, UTF‑8‑BOM, Windows‑1252, Latin‑1, and more)
- Fixes line endings – CRLF → LF
- Strips trailing whitespace from every line
- Ensures a single newline at end of file
- Skips binary files automatically
It works on Python, JavaScript, TypeScript, Go, Rust, C, C++, and Java files.
Install
pip install code-normalizer-pro
Requires Python 3.10+. Core has zero dependencies beyond tqdm for progress bars.
Basic usage
# See what would change without touching anything
code-normalizer-pro /path/to/project --dry-run
# Fix everything in‑place
code-normalizer-pro /path/to/project --in-place
# Specific extensions only
code-normalizer-pro /path/to/project -e .py -e .js --in-place
Dry‑run output example
Scanning /path/to/project...
[changed] src/utils.py -- trailing whitespace (34 chars), CRLF endings
[changed] src/main.js -- encoding: windows-1252 → utf-8
[skip] assets/logo.png -- binary
[ok] tests/test_core.py
Total: 47 files | 2 changed | 1 skipped | 44 already clean
Nothing is written in dry‑run mode; you see exactly what would happen.
Parallel processing for large codebases
Sequential mode processes about 20–30 files per second – fine for a typical repo.
For anything over a few thousand files, use parallel mode:
code-normalizer-pro /path/to/project --parallel --in-place
It uses all available CPU cores by default. You can cap the workers:
code-normalizer-pro /path/to/project --parallel --workers 4 --in-place
Benchmarks (Python codebase, ~200 lines/file)
| Mode | 100 files | 500 files | 1000 files |
|---|---|---|---|
| Sequential | 3.2 s | 16.8 s | 33.5 s |
| Parallel (4 workers) | 1.1 s | 4.3 s | 7.1 s |
| Parallel (8 workers) | 0.8 s | 2.9 s | 4.8 s |
SHA‑256 caching for repeat runs
- First run processes everything and writes a
.normalize-cache.jsonfile. - Subsequent runs skip unchanged files entirely.
code-normalizer-pro /path/to/project --cache --in-place --parallel
Second‑run output example
All discovered files were unchanged and skipped by cache.
Cached hits: 1000
Total runtime: 0.8s
This speeds up CI dramatically – a normalisation step that touches only a handful of files can finish in under a second instead of tens of seconds.
Pre‑commit hook
Enforce standards across the team with a Git hook:
# Run once inside any git repo
code-normalizer-pro --install-hook
The command writes a pre‑commit hook that checks staged files before every commit. If any file needs normalisation, the commit is blocked and the fix command is printed:
Checking 5 staged file(s)...
Files that need normalization:
src/feature.py
src/utils.js
Run: code-normalizer-pro src/feature.py src/utils.js --in-place
Or: git commit --no-verify (to skip this check)
The developer fixes the files, re‑stages, and commits. No extra config file is required. The hook uses the Python interpreter that installed the package, so it works in virtualenvs without additional setup.
CI integration
Add a normalisation check to your pipeline in about 10 lines.
GitHub Actions example
name: Code hygiene check
on: [push, pull_request]
jobs:
normalize-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install code-normalizer-pro
- run: code-normalizer-pro . --dry-run --parallel
The --dry-run flag currently exits 0 regardless of violations. A --fail-on-changes flag is on the roadmap – for now you can grep the output for “changed” and exit 1 accordingly. I will document the workaround in the README until the flag ships.
Interactive mode
If you are normalising a codebase for the first time and want to review each change before it is written:
code-normalizer-pro /path/to/project --interactive
It shows a diff for each file and waits for y / n / d (show full diff) / q (quit). This is useful when you are unsure what you are about to change in a legacy codebase.
What I learned building this
-
Encoding detection is hard. UTF‑16 files without a BOM are indistinguishable from binary garbage unless you apply heuristic analysis. I ended up with a layered approach – check for a BOM first, then try a candidate list in order, then fall back to binary detection. There are still edge cases.
-
ProcessPoolExecutor and in‑place writes need careful handling. When you spawn worker processes that rewrite files, you must avoid race conditions and ensure that a file is not being read while another worker is writing it. I solved this by having each worker write to a temporary file and then atomically replace the original after the worker finishes.
-
Progress reporting across processes.
tqdmworks nicely with a sharedtqdminstance wrapped by amultiprocessing.Managerqueue, giving a smooth unified progress bar. -
Cache invalidation. Using a SHA‑256 hash of the file’s raw bytes (after normalisation) provides a reliable cache key. If the file changes, the hash changes and the cache entry is refreshed automatically.
Feel free to try it out, open issues, or contribute! 🎉
Workers, Backup Creation, and Cache Path
-
Backup creation must happen before dispatch – not inside the worker.
Otherwise, parallel mode silently skips backups. This is a known bug in the current release that I am fixing next. -
Cache path matters. The cache file should live next to the target directory, not in the CWD.
If you run the tool from a different working directory each time, the cache never hits. This is also on the fix list.
Current State and Roadmap
Version: v3.0.1‑alpha.1
It works and I use it on my own projects daily.
Rough edges I’m actively fixing
--parallel --in-placeskips backups (data‑safety issue – high priority)- Cache file lands in CWD instead of the target directory
--dry-runexits0even when violations are found (need--fail-on-changes)- No
--versionflag yet
Coming next
.gitignorepattern support (skip files the project already ignores)--git-stagedmode (normalize only what is staged, like the pre‑commit hook does)--fail-on-changesfor CI--versionflag
Try It and Tell Me What Is Missing
pip install code-normalizer-pro
code-normalizer-pro . --dry-run
The two things I want to know from anyone who tries it
- CI compatibility: Does it work in your CI setup? If it broke something, I want to know exactly how.
- Missing language or workflow: What language or workflow is missing that would make this useful to you?
Source is on GitHub:
If you hit a bug or have a feature request, open an issue. I respond to everything.
Built with Python 3.10+. Zero required external dependencies except tqdm.
MIT license.