I built a CLI to fix the encoding/newline/whitespace noise that pollutes your diffs
Source: Dev.to
Source: Dev.to
code‑normalizer‑pro
I built a CLI to fix the encoding / newline / whitespace noise that pollutes your diffs.
Every team I have worked on eventually hits the same invisible problem:
- Someone on Windows commits a file.
- Someone on macOS pulls it.
- The diff shows hundreds of changed lines.
Nothing actually changed – it was trailing spaces, CRLF endings, a BOM, or a file that got re‑saved in a different encoding. The code review becomes useless because the real changes are buried in whitespace noise.
I got tired of fixing this manually on every project, so I built code‑normalizer‑pro – a CLI that handles all of it in one pass.
What It Does
A single command normalizes an entire directory:
- Convert encoding to UTF‑8 (supports UTF‑16, UTF‑8‑BOM, Windows‑1252, Latin‑1, etc.)
- Convert line endings from CRLF → LF
- Strip trailing whitespace from every line
- Ensure a single newline at the end of each file
- Automatically skip binary files
Supported file types include Python, JavaScript, TypeScript, Go, Rust, C, C++, and Java.
Installation
pip install code-normalizer-pro- Requires Python 3.10+.
- Core has zero dependencies beyond tqdm for progress bars.
Basic usage
# See what would change without touching anything
code-normalizer-pro /path/to/project --dry-run
# Fix everything in‑place
code-normalizer-pro /path/to/project --in-place
# Specific extensions only
code-normalizer-pro /path/to/project -e .py -e .js --in-placeDry‑run output example
Scanning /path/to/project...
[changed] src/utils.py -- trailing whitespace (34 chars), CRLF endings
[changed] src/main.js -- encoding: windows-1252 → utf-8
[skip] assets/logo.png -- binary
[ok] tests/test_core.py
Total: 47 files | 2 changed | 1 skipped | 44 already cleanIn dry‑run mode nothing is written; the output shows exactly what would happen.
Parallel Processing for Large Codebases
Sequential mode processes about 20–30 files per second – fine for a typical repository.
For anything over a few thousand files, switch to parallel mode:
code-normalizer-pro /path/to/project --parallel --in-placeBy default it uses all available CPU cores. To limit the number of workers:
code-normalizer-pro /path/to/project --parallel --workers 4 --in-placeBenchmarks (Python codebase, ~200 lines/file)
| Mode | 100 files | 500 files | 1 000 files |
|---|---|---|---|
| Sequential | 3.2 s | 16.8 s | 33.5 s |
| Parallel (4 workers) | 1.1 s | 4.3 s | 7.1 s |
| Parallel (8 workers) | 0.8 s | 2.9 s | 4.8 s |
SHA‑256 Caching for Repeat Runs
- First run – processes every file and creates a
.normalize-cache.jsonfile. - Subsequent runs – skip unchanged files entirely.
code-normalizer-pro /path/to/project --cache --in-place --parallelSecond‑Run Output Example
All discovered files were unchanged and skipped by cache.
Cached hits: 1000
Total runtime: 0.8sThis dramatically speeds up CI: a normalization step that touches only a handful of files can finish in under a second instead of tens of seconds.
Pre‑commit Hook
Enforce standards across the team with a Git hook:
# Run once inside any git repo
code-normalizer-pro --install-hookThe command writes a pre‑commit hook that checks staged files before every commit. If any file needs normalisation, the commit is blocked and the fix command is printed:
Checking 5 staged file(s)...
Files that need normalization:
src/feature.py
src/utils.js
Run: code-normalizer-pro src/feature.py src/utils.js --in-place
Or: git commit --no-verify (to skip this check)- The developer fixes the files.
- Re‑stage the changes.
- Commit again.
No extra configuration file is required. The hook uses the Python interpreter that installed the package, so it works in virtual environments without additional setup.
CI Integration
Add a normalization check to your pipeline in about 10 lines.
GitHub Actions Example
name: Code hygiene check
on: [push, pull_request]
jobs:
normalize-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install code-normalizer-pro
- run: code-normalizer-pro . --dry-run --parallel- The
--dry-runflag currently exits 0 regardless of violations. - A
--fail-on-changesflag is on the roadmap – for now you can grep the output for the word “changed” and exit 1 accordingly.
I will document this workaround in the README until the flag ships.
Interactive mode
If you are normalising a codebase for the first time and want to review each change before it is written, run:
code-normalizer-pro /path/to/project --interactiveThe tool shows a diff for each file and waits for your input:
y– accept the changen– reject the changed– show the full diffq– quit
This is useful when you are unsure what you are about to change in a legacy codebase.
What I learned building this
Encoding detection is hard. UTF‑16 files without a BOM are indistinguishable from binary garbage unless you apply heuristic analysis. I ended up with a layered approach: check for a BOM first, then try a candidate list in order, then fall back to binary detection. There are still edge cases.
ProcessPoolExecutor and in‑place writes need careful handling. When you spawn worker processes that rewrite files, you must avoid race conditions and ensure that a file is not being read while another worker is writing it. I solved this by having each worker write to a temporary file and then atomically replace the original after the worker finishes.
Progress reporting across processes.
tqdmworks nicely with a sharedtqdminstance wrapped by amultiprocessing.Managerqueue, giving a smooth unified progress bar.Cache invalidation. Using a SHA‑256 hash of the file’s raw bytes (after normalisation) provides a reliable cache key. If the file changes, the hash changes and the cache entry is refreshed automatically.
Feel free to try it out, open issues, or contribute! 🎉
Workers, Backup Creation, and Cache Path
Backup creation must occur before dispatch – not inside the worker.
If it’s done inside the worker, parallel mode silently skips backups. This is a known bug in the current release that I’m fixing next.Cache path matters. The cache file should reside next to the target directory, not in the current working directory (CWD).
Running the tool from different working directories prevents the cache from being hit. This is also on the fix list.
Current State and Roadmap
Version: v3.0.1‑alpha.1
It works and I use it on my own projects daily.
Rough edges I’m actively fixing
--parallel --in-placeskips backups (data‑safety issue – high priority)- Cache file lands in the current working directory instead of the target directory
--dry-runexits with status 0 even when violations are found (need--fail-on-changes)- No
--versionflag yet
Coming next
.gitignorepattern support (skip files the project already ignores)--git-stagedmode (normalize only what is staged, like the pre‑commit hook does)--fail-on-changesfor CI--versionflag
Try It and Tell Me What’s Missing
pip install code-normalizer-pro
code-normalizer-pro . --dry-runThe two things I’d like to hear from anyone who tries it
- CI compatibility – Does it work in your CI pipeline? If it breaks anything, please describe exactly what happened.
- Missing language or workflow – Which language or workflow isn’t supported yet that would make this tool useful for you?
The source code is on GitHub. If you encounter a bug or have a feature request, open an issue – I respond to everything.
Built with Python 3.10+. Zero external dependencies except tqdm.
MIT license.