🔥 Day 7: PySpark Joins, Unions, and GroupBy Guide

Published: 2 months ago (December 8, 2025 at 05:46 AM EST)

1 min read

Source: Dev.to

Joins in PySpark — The Heart of ETL Pipelines

A join merges two DataFrames based on keys, similar to SQL.

Basic Join

df.join(df2, df.id == df2.id, "inner")

Join on Same Column Name

df.join(df2, ["id"], "left")

Types of Joins in Spark

inner
left / left_outer
right / right_outer
full / full_outer
left_semi – existence check (returns rows from the left DataFrame that have a match in the right)
left_anti – anti‑join (returns rows from the left DataFrame that do not have a match)

Union — Stack DataFrames Vertically

Union (same schema, same order)

df.union(df2)

Union by Column Names (different order)

df.unionByName(df2)

Use unions to combine monthly files, daily ingestion datasets, or partitioned data.

GroupBy + Aggregation — Business Logic Layer

Example

df.groupBy("dept").agg(
    sum("salary").alias("total_salary"),
    avg("age").alias("avg_age")
)

`count` vs `countDistinct`

df.select(count("id"))
df.select(countDistinct("id"))

Approximate Distinct Count (faster)

df.select(approx_count_distinct("id"))

Real ETL Example — Sales Aggregation

Join Sales and Products

df_joined = sales.join(products, "product_id", "left")

Aggregate Revenue by Category

df_agg = df_joined.groupBy("category").agg(
    sum("amount").alias("total_revenue"),
    count("*").alias("transactions")
)

This pattern is exactly how business dashboards are built.

Join Performance Optimization

Broadcast Join for Small Lookup Tables

df.join(broadcast(df_small), "id")

Broadcast joins avoid a shuffle, making the operation much faster.

Summary

Joins (including left_semi checks)
Union / unionByName
groupBy with aggregations
count, countDistinct, approx_count_distinct
Broadcast join optimization

Feel free to comment if anything is missing. Thank you!

🔥 Day 7: PySpark Joins, Unions, and GroupBy Guide

Joins in PySpark — The Heart of ETL Pipelines

Basic Join

Join on Same Column Name

Types of Joins in Spark

Union — Stack DataFrames Vertically

Union (same schema, same order)

Union by Column Names (different order)

GroupBy + Aggregation — Business Logic Layer

Example

`count` vs `countDistinct`

Approximate Distinct Count (faster)

Real ETL Example — Sales Aggregation

Join Sales and Products

Aggregate Revenue by Category

Join Performance Optimization

Broadcast Join for Small Lookup Tables

Summary

Related posts

🔥 Day 5: Introduction to DataFrames - The Most Importantce of Spark API

🔥 Day 3: RDDs - The Foundation of Spark

Data Pipeline Tools Compared: Key Criteria to Pick the Right One

Clean Code in ETL:How Python, Go, and SQL Each Teach You to Think Differently

Joins in PySpark — The Heart of ETL Pipelines

Basic Join

Join on Same Column Name

Types of Joins in Spark

Union — Stack DataFrames Vertically

Union (same schema, same order)

Union by Column Names (different order)

GroupBy + Aggregation — Business Logic Layer

Example

count vs countDistinct

Approximate Distinct Count (faster)

Real ETL Example — Sales Aggregation

Join Sales and Products

Aggregate Revenue by Category

Join Performance Optimization

Broadcast Join for Small Lookup Tables

Summary

Related posts

🔥 Day 5: Introduction to DataFrames - The Most Importantce of Spark API

🔥 Day 3: RDDs - The Foundation of Spark

Data Pipeline Tools Compared: Key Criteria to Pick the Right One

Clean Code in ETL:How Python, Go, and SQL Each Teach You to Think Differently

`count` vs `countDistinct`