🔥 Day 7: PySpark Joins, Unions, and GroupBy Guide
Source: Dev.to
Joins in PySpark — The Heart of ETL Pipelines
A join merges two DataFrames based on keys, similar to SQL.
Basic Join
df.join(df2, df.id == df2.id, "inner")
Join on Same Column Name
df.join(df2, ["id"], "left")
Types of Joins in Spark
innerleft/left_outerright/right_outerfull/full_outerleft_semi– existence check (returns rows from the left DataFrame that have a match in the right)left_anti– anti‑join (returns rows from the left DataFrame that do not have a match)
Union — Stack DataFrames Vertically
Union (same schema, same order)
df.union(df2)
Union by Column Names (different order)
df.unionByName(df2)
Use unions to combine monthly files, daily ingestion datasets, or partitioned data.
GroupBy + Aggregation — Business Logic Layer
Example
df.groupBy("dept").agg(
sum("salary").alias("total_salary"),
avg("age").alias("avg_age")
)
count vs countDistinct
df.select(count("id"))
df.select(countDistinct("id"))
Approximate Distinct Count (faster)
df.select(approx_count_distinct("id"))
Real ETL Example — Sales Aggregation
Join Sales and Products
df_joined = sales.join(products, "product_id", "left")
Aggregate Revenue by Category
df_agg = df_joined.groupBy("category").agg(
sum("amount").alias("total_revenue"),
count("*").alias("transactions")
)
This pattern is exactly how business dashboards are built.
Join Performance Optimization
Broadcast Join for Small Lookup Tables
df.join(broadcast(df_small), "id")
Broadcast joins avoid a shuffle, making the operation much faster.
Summary
- Joins (including
left_semichecks) - Union /
unionByName groupBywith aggregationscount,countDistinct,approx_count_distinct- Broadcast join optimization
Feel free to comment if anything is missing. Thank you!