How to Size a Spark Cluster. And How Not To.
Source: Dev.to
Interview Question
You need to process 1 TB of data in Spark. How do you size the cluster?
Most interview answers start with a simple division:
1 TB → choose 128 MB partitions → ~8 000 partitions → map to cores → decide number of nodes
That approach is clean and logical, but it is also incomplete.
Cluster size is not derived from the raw data size alone; it is driven by the workload behavior.
1. What does “1 TB” really mean?
| Aspect | Why it matters |
|---|---|
| Compressed Parquet in object storage | Gives only storage‑efficiency, not runtime footprint. |
| Partition pruning | Skips whole directory partitions based on filters. |
| Predicate push‑down | Pushes filters to the storage layer so only matching row groups are read. |
| Column pruning | Reads only the columns that are needed. |
Example: A 1 TB table partitioned by date may result in a 150‑300 GB scan for a single‑day query.
Cluster sizing must be based on the actual scan size, not the table size.
2. Runtime expansion of columnar data
- On‑disk: Parquet is compressed & encoded.
- In‑memory: Decompressed, decoded, and materialised into Spark’s internal row format.
A 1 TB compressed dataset can inflate to 2‑4 TB across executors during processing, affecting:
- Executor memory sizing
- Spill probability
- GC pressure
- Memory‑overhead configuration
Note: Disk size is rarely the memory anchor; memory is the real anchor for sizing.
3. Spark’s execution model
Spark runs a DAG of stages separated by shuffles.
A 1 TB job might look like:
- Filter → 400 GB
- Join & expand → 2.5 TB shuffle
- Aggregate → 50 GB
Spark cares not about the input size but about the largest intermediate state it must shuffle, sort, or spill.
If a join explodes to 2.5 TB, that is your sizing baseline.
4. Variability & tail‑risk
- Stable: 1 TB every day?
- Fluctuating: 800 GB on normal days, 1.4 TB on quarter‑end.
Production systems fail at the 95th‑percentile load, not the mean.
Size for the tail, not the average.
5. Spark’s design assumptions (and when they break)
| Assumption | What happens when it fails |
|---|---|
| Data can be evenly partitioned | Skew creates hot partitions → bottlenecks |
| Most transformations are narrow | Wide shuffles become expensive |
| Network is slower than CPU | Adding cores to a network‑saturated node yields no benefit |
| Memory is finite | OOM, excessive spilling, GC pressure |
When these assumptions hold, Spark scales predictably.
When they don’t, adding nodes does not fix the root cause.
6. Workload‑centric bottleneck classification
6.1 CPU‑bound (heavy UDFs, encryption, compression)
| Signals | Action |
|---|---|
| High CPU utilisation, low spill, minimal shuffle wait | Scale cores & use compute‑optimised instances |
6.2 Memory‑bound (large joins, wide aggregations, caching)
| Signals | Action |
|---|---|
| High spill metrics, high GC time, executor OOM | Increase executor memory or reduce per‑task footprint |
6.3 I/O‑bound (object‑store reads, small files, slow disks)
| Signals | Action |
|---|---|
| Low CPU utilisation, high file‑open overhead, high task deserialization time | Fix file layout & compaction before scaling compute |
6.4 Shuffle‑heavy
| Signals | Action |
|---|---|
| High shuffle‑read fetch wait, low CPU during reduce stage, executors waiting on remote blocks | Remember network bandwidth per node is fixed – adding cores to a saturated node rarely helps. |
7. Common “1 TB” pitfalls
- Ignoring shuffle multiplier – Shuffle volume can be 2‑3× the input size.
- Hot keys – A single hot key can create a 200 GB partition, turning one executor into a bottleneck.
- Skew – Uneven data distribution collapses parallelism.
How to detect skew in Spark UI
- Compare max task duration vs. median.
- Check shuffle read size per task.
- Look for a reducer processing a disproportionate amount of data.
If one task runs 10× longer than others → distribution problem, not cluster size.
Mitigation strategies
- Salting hot keys
- Pre‑aggregation before join
- Broadcast joins when feasible
- Increase shuffle partitions for moderate skew
- Redesign data model
8. Spill & disk throughput
When execution memory fills during shuffle or sort, Spark spills to local disk, making disk throughput the new bottleneck.
Symptoms of slow local disks
- Increased task duration
- Longer executor lifetimes
- Higher GC pressure
- Non‑linear stage slowdown
How to identify
- High spill metrics
- Growing task duration during shuffle stages
- Elevated GC time
Mitigation
- Increase executor memory
- Reduce per‑task partition size
- Increase shuffle partitions
- Use faster local disks (NVMe)
- Reduce shuffle footprint upstream
9. Data layout matters
| Question | Impact |
|---|---|
| Where does the 1 TB live? | 5 large Parquet files vs. 800 k small files vs. correctly partitioned data |
| Is the data clustered on join keys? | Affects shuffle cost dramatically |
- Small files → higher task‑scheduling overhead, file‑listing latency, driver pressure.
- Poor partitioning → larger scan size.
- Wrong clustering → higher shuffle cost.
10. The “right” answer to “How big should the cluster be?”
Sometimes the correct answer is: Fix the data layout first.
Only after the data is well‑partitioned, compacted, and appropriately clustered does it make sense to talk about the number of nodes, cores, and memory per executor.
TL;DR Checklist
- Determine actual scan size (pruning, push‑down, column selection).
- Estimate runtime data expansion (2‑4× compression ratio).
- Identify the largest intermediate state (shuffle, join, aggregation).
- Classify the bottleneck (CPU, memory, I/O, shuffle).
- Validate data layout (file size, partitioning, clustering).
- Size memory & cores based on steps 1‑4, not on raw 1 TB.
- Plan for tail‑risk (95th‑percentile load).
By following this systematic approach you’ll give a complete, production‑ready answer that goes far beyond “1 TB ÷ 128 MB = 8 000 partitions”.
Cluster Sizing – A Structured Approach
1. Why SLA‑first sizing matters
- SLA drives the math – you cannot size a cluster for a 20‑minute completion if the SLA is 2 hours, and vice‑versa.
- Throughput‑centric equation
[ \text{Required throughput} = \frac{\text{Peak data volume}}{\text{SLA}} ]
[ \text{Node count} = \frac{\text{Required throughput}}{\text{Per‑node effective throughput}} ]
Key point: Cluster sizing is a throughput problem, not a storage problem.
2. Shared‑cluster realities
| Resource | Reality on a shared cluster |
|---|---|
| CPU cores | You do not get full cores |
| Memory | You do not get full memory |
| Shuffle service | Shared among tenants |
| Concurrency | Directly impacts availability |
Ignoring these factors makes isolated “cluster math” inaccurate in practice.
3. The full set of questions to answer
- Peak intermediate size
- Bottleneck type (CPU, memory, shuffle, I/O, network)
- Shuffle volume
- Storage throughput
- SLA target
- Input variance (size, schema, skew)
- Isolation model (dedicated vs. shared)
4. From answers to concrete numbers
| Calculation | What you derive |
|---|---|
| Target partition size | Desired size per partition (e.g., 128 MiB) |
| Required partitions | ceil(Peak intermediate size / Target partition size) |
| Required concurrent tasks | Required partitions / Executors per node |
| Executors per node | Based on CPU cores & memory per executor |
| Memory per executor | Executor memory = (Node memory – overhead) / Executors per node |
| Node count | ceil(Required concurrent tasks / Executors per node) |
Result: A mathematically grounded cluster configuration.
Without answering the questions above, any sizing is pure guesswork.
5. “How do you size a cluster for 1 TB?” – the right answer
I don’t size clusters on raw data size.
I size them on peak intermediate state, the dominant bottleneck, and the SLA constraints.
Data size is only the starting point; workload behavior determines the final cluster.
6. Modern Databricks Runtime (Spark 4.x) – what changes?
| Databricks feature | What it does |
|---|---|
| Adaptive Query Execution (AQE) (enabled by default) | Coalesces shuffle partitions, mitigates moderate skew |
| Photon | Reduces CPU pressure for SQL/DataFrame workloads |
| Delta Lake layout strategies | Reduce scan inefficiency & small‑file overhead |
| OPTIMIZE | Compacts small files |
| Z‑ORDER | Improves multi‑column data locality |
| Liquid Clustering | Replaces static partitioning & Z‑ORDER with dynamic clustering |
| Predictive Optimization | Automates compaction & maintenance |
Benefits
- File compaction
- Data skipping
- Faster reads
- Lower metadata overhead
What still remains a challenge
- Shuffle cost
- Skew
- Network ceilings
- Spill behavior
- Peak intermediate pressure
Takeaway: On Databricks, cluster sizing is often the last lever, not the first. Abstractions help, but they don’t erase the underlying distributed‑systems physics.
7. Looking ahead
In the next post we’ll explore serverless Spark – what changes when the cluster “disappears,” and how responsibility shifts without altering the fundamental constraints.