Day 10: Partitioning vs Bucketing - The Spark Optimization Guide Every Data Engineer Needs

Published: (December 9, 2025 at 01:48 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Why Partitioning Matters in Spark

Example

df.write.partitionBy("year", "month").parquet("/sales")

This creates folders such as:

year=2024/month=01/

Benefits

  • Queries that filter on partition columns can skip entire folders
  • Reduced I/O
  • Faster scans
  • Lower compute cost

Partitioning is a core technique in all Lakehouse architectures.

Repartition() vs Coalesce()

Repartition()

df = df.repartition("customer_id")
  • Provides an even distribution of data across partitions.
  • Triggers a full shuffle.

Coalesce()

df = df.coalesce(5)
  • Reduces the number of partitions without a full shuffle.
  • Useful when you want fewer partitions but can tolerate uneven distribution.

When Should You Partition Your Data?

Partition when:

  • You frequently filter on the same column.
  • You have time‑based data (e.g., year, month).
  • You need faster analytics on large datasets.

Avoid partitioning when:

  • The column has millions of unique values (high cardinality).
  • Resulting files become extremely small (e.g., < 1 MB each), leading to overhead.

What is Bucketing and Why It’s Powerful?

Bucketing reduces shuffle for large‑table joins by deterministically assigning rows to a fixed number of buckets.

Example

df.write.bucketBy(20, "id").sortBy("id").saveAsTable("bucketed_users")
  • Creates 20 bucket files.
  • Enables faster joins because matching rows are guaranteed to be in the same bucket.

Benefits

  • Faster joins, especially on high‑cardinality columns.
  • Deterministic data distribution.
  • Improves performance for repeated join operations.

Partition vs Bucket — Which One Should You Use?

  • Use Partitioning when you need to prune data based on filter predicates (e.g., time‑based queries).
  • Use Bucketing when you want to optimize join performance on large tables, particularly when the join key has high cardinality.

Real‑World Use Case (E‑Commerce)

Sales data

  • Partition by: year, month, country

User table

  • Bucket by: user_id

When joining

  • Bucketed tables enable fast joins because matching rows reside in the same bucket.

This pattern is commonly employed in Databricks Lakehouse architectures.

Summary

  • Partitioning organizes data on disk to enable predicate push‑down and reduce I/O.
  • Bucketing groups data into a fixed number of files to minimize shuffle during joins.
  • Repartition provides even distribution with a shuffle; coalesce reduces partitions without a full shuffle.
  • Choose partition keys based on query filter patterns and data volume.
  • Use bucketing for high‑cardinality join keys to accelerate large joins.
Back to Blog

Related posts

Read more »