🔥 Day 7: PySpark Joins, Unions, and GroupBy Guide

Published: (December 8, 2025 at 05:46 AM EST)
1 min read
Source: Dev.to

Source: Dev.to

Joins in PySpark — The Heart of ETL Pipelines

A join merges two DataFrames based on keys, similar to SQL.

Basic Join

df.join(df2, df.id == df2.id, "inner")

Join on Same Column Name

df.join(df2, ["id"], "left")

Types of Joins in Spark

  • inner
  • left / left_outer
  • right / right_outer
  • full / full_outer
  • left_semi – existence check (returns rows from the left DataFrame that have a match in the right)
  • left_anti – anti‑join (returns rows from the left DataFrame that do not have a match)

Union — Stack DataFrames Vertically

Union (same schema, same order)

df.union(df2)

Union by Column Names (different order)

df.unionByName(df2)

Use unions to combine monthly files, daily ingestion datasets, or partitioned data.

GroupBy + Aggregation — Business Logic Layer

Example

df.groupBy("dept").agg(
    sum("salary").alias("total_salary"),
    avg("age").alias("avg_age")
)

count vs countDistinct

df.select(count("id"))
df.select(countDistinct("id"))

Approximate Distinct Count (faster)

df.select(approx_count_distinct("id"))

Real ETL Example — Sales Aggregation

Join Sales and Products

df_joined = sales.join(products, "product_id", "left")

Aggregate Revenue by Category

df_agg = df_joined.groupBy("category").agg(
    sum("amount").alias("total_revenue"),
    count("*").alias("transactions")
)

This pattern is exactly how business dashboards are built.

Join Performance Optimization

Broadcast Join for Small Lookup Tables

df.join(broadcast(df_small), "id")

Broadcast joins avoid a shuffle, making the operation much faster.

Summary

  • Joins (including left_semi checks)
  • Union / unionByName
  • groupBy with aggregations
  • count, countDistinct, approx_count_distinct
  • Broadcast join optimization

Feel free to comment if anything is missing. Thank you!

Back to Blog

Related posts

Read more »