Day 10: Partitioning vs Bucketing - The Spark Optimization Guide Every Data Engineer Needs
Source: Dev.to
Why Partitioning Matters in Spark
Example
df.write.partitionBy("year", "month").parquet("/sales")
This creates folders such as:
year=2024/month=01/
Benefits
- Queries that filter on partition columns can skip entire folders
- Reduced I/O
- Faster scans
- Lower compute cost
Partitioning is a core technique in all Lakehouse architectures.
Repartition() vs Coalesce()
Repartition()
df = df.repartition("customer_id")
- Provides an even distribution of data across partitions.
- Triggers a full shuffle.
Coalesce()
df = df.coalesce(5)
- Reduces the number of partitions without a full shuffle.
- Useful when you want fewer partitions but can tolerate uneven distribution.
When Should You Partition Your Data?
Partition when:
- You frequently filter on the same column.
- You have time‑based data (e.g., year, month).
- You need faster analytics on large datasets.
Avoid partitioning when:
- The column has millions of unique values (high cardinality).
- Resulting files become extremely small (e.g., < 1 MB each), leading to overhead.
What is Bucketing and Why It’s Powerful?
Bucketing reduces shuffle for large‑table joins by deterministically assigning rows to a fixed number of buckets.
Example
df.write.bucketBy(20, "id").sortBy("id").saveAsTable("bucketed_users")
- Creates 20 bucket files.
- Enables faster joins because matching rows are guaranteed to be in the same bucket.
Benefits
- Faster joins, especially on high‑cardinality columns.
- Deterministic data distribution.
- Improves performance for repeated join operations.
Partition vs Bucket — Which One Should You Use?
- Use Partitioning when you need to prune data based on filter predicates (e.g., time‑based queries).
- Use Bucketing when you want to optimize join performance on large tables, particularly when the join key has high cardinality.
Real‑World Use Case (E‑Commerce)
Sales data
- Partition by:
year,month,country
User table
- Bucket by:
user_id
When joining
- Bucketed tables enable fast joins because matching rows reside in the same bucket.
This pattern is commonly employed in Databricks Lakehouse architectures.
Summary
- Partitioning organizes data on disk to enable predicate push‑down and reduce I/O.
- Bucketing groups data into a fixed number of files to minimize shuffle during joins.
- Repartition provides even distribution with a shuffle; coalesce reduces partitions without a full shuffle.
- Choose partition keys based on query filter patterns and data volume.
- Use bucketing for high‑cardinality join keys to accelerate large joins.