Back Up Multiple Drives to Backblaze with Deduplication – Introducing b2-dedup
Source: Dev.to
Introduction
Backing up terabytes of data across multiple drives, NAS boxes, or different computers? You want everything in one safe, off‑site place without paying a fortune for redundant uploads or duplicate storage.
That’s exactly why I built b2-dedup — a parallel, streaming deduplicating uploader tailored for Backblaze B2.
What You’ll Learn
- Why Backblaze B2 is (in my opinion) the smartest choice for personal or large‑scale backups vs. AWS S3, Azure Blob, etc.
- How deduplication across drives saves you time, bandwidth, and money
- Step‑by‑step setup and usage of b2-dedup
Backblaze B2 vs. Other Cloud Storages (2025‑2026)
| Service | Storage Cost (≈) | Egress Cost |
|---|---|---|
| B2 | ~$6 /TB / month (recently adjusted from $5/TB) | Free egress up to 3× your monthly stored average (unlimited free to many CDNs/partners like Cloudflare, Fastly, etc.) |
| AWS S3 Standard | ~$23 /TB / month (first 50 TB tier) | $0.08–$0.09 / GB after free tier (restoring 1 TB ≈ $80) |
| Azure Blob Hot | Similar ballpark to S3 (~$18–$23 /TB) | Same as S3 |
Bottom line: B2 is roughly 1/4 to 1/5 the price of always‑hot, instantly accessible storage.
Other Gotchas
- No upload fees, delete penalties, minimum file sizes, or hidden API‑call charges that sting on large backups (B2 keeps Class A calls free).
- S3‑compatible API → works with rclone, restic, Veeam, etc.
- No complex storage tiers/classes to accidentally get stuck in (unless you deliberately use Glacier/Archive for cold data).
For personal users, homelab hoarders, photographers/videographers, or small businesses doing off‑site backups, B2 wins on predictable low cost + sane egress.
Why a New Tool? – The Need for Cross‑Drive Deduplication
When you back up multiple drives (e.g., main PC SSD, external HDDs, media NAS), you often have tons of duplicate files — same photos, movies, installers, OS images across machines.
Standard tools (rclone, Duplicati, etc.) usually deduplicate within one backup job, but not across entirely separate sources.
b2-dedup fixes that:
- Uses a local SQLite DB (
~/b2_dedup.db) to remember SHA‑256 hashes of every file it has ever seen. - When you point it at Drive #2, it skips anything already uploaded from Drive #1.
- Parallel uploads (default 10 workers, tunable) + streaming chunked uploads → low memory, high speed.
- Resumable — interrupted jobs pick up where they left off.
- Scan‑only / dry‑run modes for safety.
Result: One B2 bucket, many “drive‑name” prefixes (e.g., PC2025/, MediaNAS/, Laptop/) — but real storage usage is minimized because duplicates aren’t re‑uploaded.
Prerequisites
- Python 3.8+
- A Backblaze B2 account + bucket created
- B2 Application Key (KeyID + Application Key) — generate one with Read + Write access to your bucket
Installation
# Clone the repository
git clone https://github.com/n0nag0n/b2-dedup.git
cd b2-dedup
# Install Python dependencies
pip install -r requirements.txt
Install the official B2 CLI (optional but recommended)
pip install b2
b2 account authorize # follow prompts with your KeyID + App Key
b2-dedup will automatically use those credentials.
(Alternatively, export environment variables: B2_KEY_ID and B2_APPLICATION_KEY.)
Usage Examples
1️⃣ First drive (baseline)
# Optional: just scan & hash everything first (no upload)
python b2_dedup.py /mnt/primary-drive \
--drive-name PrimaryPC \
--bucket my-backup-bucket-123 \
--scan-only
# Then do the real upload
python b2_dedup.py /mnt/primary-drive \
--drive-name PrimaryPC \
--bucket my-backup-bucket-123
2️⃣ Second (or Nth) drive — duplicates are skipped!
python b2_dedup.py /mnt/media-drive \
--drive-name MediaNAS \
--bucket my-backup-bucket-123 \
--workers 20
Pro tip: Dry‑run first to preview
python b2_dedup.py /mnt/media-drive \
--drive-name MediaNAS \
--bucket my-backup-bucket-123 \
--dry-run
Useful Flags
| Flag | Description |
|---|---|
--workers N | Number of parallel upload workers (default 10). Increase if your internet/upload can handle it. |
--dry-run | Show what would be uploaded without actually sending data. |
--scan-only | Build/populate the hash DB without touching B2. |
--refresh-count | Force re‑count of files (useful if source changed a lot). |
--drive-name NAME | Prefix used inside the bucket (e.g., PrimaryPC/). Change when you rename or reorganize drives. |
How Deduplication Works
- Prefix =
--drive-name(e.g.,PrimaryPC/Documents/report.docx). - Deduplication happens on content hash — identical files are stored only once, regardless of path/name.
- The DB lives at
~/b2_dedup.db— back it up! (It’s tiny, but losing it means re‑hashing everything.)
For very large initial scans, start with --scan-only overnight, then run the upload.
Combining with Other Backup Tools
b2-dedup is purely for initial / incremental deduped uploads.
You can combine it with rclone, Borg, restic, etc., for versioning or additional features.
Benefits Recap
- One cheap, durable off‑site location
- Cross‑drive deduplication to slash upload time & storage bills
- Parallel, resumable, low‑memory operation
I’ve been running similar setups for years — it’s rock‑solid for hoarding photos, videos, ISOs, and irreplaceable documents.
Get Started
- Repository:
- Star the repo if you find it useful.
- Open an issue or PR for questions, bugs, or contributions.
Happy (deduplicated) backing up! 🚀