From 3+ Days to 3.8 Hours: Scaling a .NET CSV Importer for SQL Server
Source: Dev.to
The “Good Enough” Solution That Wasn’t
Every project has that one task: “Just load this massive CSV into the database once, and we’re done.”
That was my starting point. I had a 40 GB CSV file containing complex nested JSON structures (models, diagrams, hotspots, elements) that needed to be parsed and saved into a normalized SQL Server schema.
The first version of the importer was straightforward:
- Read a row.
- Parse the JSON.
- Create entities.
- Save to the DB.
It worked, but it took over 3 days (~92 hours) to finish. For a one‑time migration this was painful but acceptable—you launch it on Friday and hope it’s done by Monday.
Then the requirements changed
The business decided this wasn’t a one‑off event. We needed to load several more files of similar size and potentially update them regularly. Suddenly, a 3‑day runtime became a blocker. Loading a queue of files would take weeks, paralyzing analytics and development. The naïve sequential importer was no longer just slow—it was unusable for the new workflow.
The Challenge: Why Was It Slow?
Parsing and inserting data sounds simple, but at scale (40 GB, millions of complex objects) the “standard” approach hits hard limits:
- Sequential processing – reading lines and parsing JSON one by one left the CPU idle while waiting for the DB, and vice‑versa.
- Database round‑trips – saving entities individually (or in tiny groups) caused massive overhead. The DB spent more time managing transactions and network calls than actually storing data.
- Memory pressure – loading full
JsonDocumentobjects for every row created huge GC pressure. - Fragility – a single error after 2 days of processing could crash the whole pipeline, forcing a restart.
The Solution: High‑Performance Architecture
To meet the new “multi‑file” requirement, I redesigned the system to be parallel, batched, and resilient.
1. Controlled Parallelism with SemaphoreSlim
Instead of a single thread, I implemented a producer‑consumer pattern using SemaphoreSlim to limit concurrency to 8 parallel workers.
- Why: It saturates the CPU and DB connection pool just enough to be fast without choking the server. Unbounded parallelism (
Parallel.ForEach) would have killed the database performance. - Safety: Each worker gets its own
DbContextviaIDbContextFactory, ensuring thread safety without lock contention.
2. Batch Inserts via EF Core (The Big Win)
This was the most critical change. Instead of context.Add(entity); context.SaveChanges(); inside the loop, the new system accumulates entities in memory and flushes them in batches of 100+ rows.
- Impact: Reduces network round‑trips by ~100× and drastically lowers transaction‑log overhead.
3. Architecture & SOLID Principles
To keep the code maintainable, I split the parsing logic into independent Processors, each responsible for a specific part of the JSON (e.g., ModelProcessor, DiagramProcessor).
- SRP (Single Responsibility): Each processor handles only its slice of the domain.
- DIP (Dependency Inversion): High‑level services depend on abstractions (
IEntityFactory,IUnitOfWork), making the system easy to test and extend.
4. Reliability Features
- Retry policies: Up to 25 retries for transient DB errors (deadlocks, timeouts).
- Graceful degradation: If one processor fails on bad data, it logs the error and continues, rather than crashing the entire import.
- Optimized parsing: Switched to
JsonElementandTryGetPropertyfor faster, low‑allocation JSON traversal.
The Results: 24× Faster
The performance jump was massive, turning a “weekend task” into a “lunch‑break task”.
| Metric | Original Version | Optimized Version | Improvement |
|---|---|---|---|
| Total Time (40 GB) | ~92 hours (3.8 days) | ~3.8 hours | ~24× |
| Throughput | 8–12 rows/sec | 192–300 rows/sec | ~25× |
| Time per 1 000 rows | 83–125 sec | 3–5 sec | ~25× |
| Parallelism | 1 thread | 8 workers | 8× |
| Memory Usage | 2 GB+ | ~400 MB | ~5× |

Key Takeaways
- Context matters: A 3‑day script is fine once. It’s fatal if you need to run it repeatedly. Always ask, “How often will we run this?”
- Batching is king: In EF Core, moving from single inserts to batching is often the single most effective performance upgrade you can make.
- Parallelism needs limits: Throwing 100 threads at SQL Server will just slow it down. Finding the “sweet spot” (e.g., 8 workers) is key.
- Resilience is a feature: When running for hours, networks will blink and deadlocks will happen. Retry policies turn crashes into minor log warnings.
Future Plans
- Add comprehensive tests (xUnit + Moq) with 85 %+ coverage for all processors.
- Profile individual pipeline stages to find the next bottleneck (likely JSON‑parsing CPU time).
- Expose configuration (batch size, thread count) to adapt to different server specs dynamically.
Code Repository
Check out the code: Link to GitHub Repository
[GitHub repository](https://github.com/belochka1-04/ParsCsvSaveInDb) 