๐Ÿ”ฅ 7์ผ ์ฐจ: PySpark Joins, Unions, ๊ทธ๋ฆฌ๊ณ  GroupBy ๊ฐ€์ด๋“œ

๋ฐœํ–‰: (2025๋…„ 12์›” 8์ผ ์˜คํ›„ 07:46 GMT+9)
2 min read
์›๋ฌธ: Dev.to

Source: Dev.to

PySpark์—์„œ์˜ Join โ€” ETL ํŒŒ์ดํ”„๋ผ์ธ์˜ ํ•ต์‹ฌ

Join์€ ๋‘ DataFrame์„ ํ‚ค๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ, SQL์˜ JOIN๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ณธ Join

df.join(df2, df.id == df2.id, "inner")

๊ฐ™์€ ์ปฌ๋Ÿผ ์ด๋ฆ„์œผ๋กœ Join

df.join(df2, ["id"], "left")

Spark์—์„œ ์ง€์›ํ•˜๋Š” Join ์œ ํ˜•

  • inner
  • left / left_outer
  • right / right_outer
  • full / full_outer
  • left_semi โ€“ ์กด์žฌ ์—ฌ๋ถ€ ํ™•์ธ (์˜ค๋ฅธ์ชฝ์— ๋งค์น˜๊ฐ€ ์žˆ๋Š” ์™ผ์ชฝ DataFrame์˜ ํ–‰์„ ๋ฐ˜ํ™˜)
  • left_anti โ€“ ์•ˆํ‹ฐโ€‘์กฐ์ธ (์˜ค๋ฅธ์ชฝ์— ๋งค์น˜๊ฐ€ ์—†๋Š” ์™ผ์ชฝ DataFrame์˜ ํ–‰์„ ๋ฐ˜ํ™˜)

Union โ€” DataFrame์„ ์ˆ˜์ง์œผ๋กœ ์Œ“๊ธฐ

๋™์ผ ์Šคํ‚ค๋งˆ, ๋™์ผ ์ˆœ์„œ์ธ ๊ฒฝ์šฐ Union

df.union(df2)

์ปฌ๋Ÿผ ์ด๋ฆ„์œผ๋กœ Union (์ˆœ์„œ๊ฐ€ ๋‹ค๋ฅธ ๊ฒฝ์šฐ)

df.unionByName(df2)

Union์€ ์›”๋ณ„ ํŒŒ์ผ, ์ผ์ผ ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ์…‹, ํ˜น์€ ํŒŒํ‹ฐ์…˜๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒฐํ•ฉํ•  ๋•Œ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

GroupBy + Aggregation โ€” ๋น„์ฆˆ๋‹ˆ์Šค ๋กœ์ง ๋ ˆ์ด์–ด

์˜ˆ์‹œ

df.groupBy("dept").agg(
    sum("salary").alias("total_salary"),
    avg("age").alias("avg_age")
)

count์™€ countDistinct

df.select(count("id"))
df.select(countDistinct("id"))

Approximate Distinct Count (๋น ๋ฅธ ๋ฐฉ๋ฒ•)

df.select(approx_count_distinct("id"))

์‹ค์ œ ETL ์˜ˆ์‹œ โ€” ๋งค์ถœ ์ง‘๊ณ„

Sales์™€ Products ์กฐ์ธ

df_joined = sales.join(products, "product_id", "left")

์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ๋งค์ถœ ์ง‘๊ณ„

df_agg = df_joined.groupBy("category").agg(
    sum("amount").alias("total_revenue"),
    count("*").alias("transactions")
)

์ด ํŒจํ„ด์ด ๋ฐ”๋กœ ๋น„์ฆˆ๋‹ˆ์Šค ๋Œ€์‹œ๋ณด๋“œ๊ฐ€ ๊ตฌ์ถ•๋˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

Join ์„ฑ๋Šฅ ์ตœ์ ํ™”

์ž‘์€ ์กฐํšŒ ํ…Œ์ด๋ธ”์— ๋Œ€ํ•œ Broadcast Join

df.join(broadcast(df_small), "id")

Broadcast Join์€ ์…”ํ”Œ์„ ํ”ผํ•˜๋ฏ€๋กœ ์ž‘์—…์ด ํ›จ์”ฌ ๋น ๋ฅด๊ฒŒ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

์š”์•ฝ

  • Join (left_semi ์ฒดํฌ ํฌํ•จ)
  • Union / unionByName
  • groupBy์™€ ์ง‘๊ณ„ ํ•จ์ˆ˜๋“ค
  • count, countDistinct, approx_count_distinct
  • Broadcast Join ์ตœ์ ํ™”

๋น ์ง„ ๋‚ด์šฉ์ด ์žˆ์œผ๋ฉด ๋Œ“๊ธ€๋กœ ์•Œ๋ ค ์ฃผ์„ธ์š”. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!

Back to Blog

๊ด€๋ จ ๊ธ€

๋” ๋ณด๊ธฐ ยป

Data Pipeline Tools ๋น„๊ต: ์˜ฌ๋ฐ”๋ฅธ ์„ ํƒ์„ ์œ„ํ•œ ํ•ต์‹ฌ ๊ธฐ์ค€

๋ฐ์ดํ„ฐ๋Š” ์šฐ๋ฆฌ ์ฃผ๋ณ€์— ๊ฐ€๋“ํ•ฉ๋‹ˆ๋‹ค โ€” CRM ์‹œ์Šคํ…œ๊ณผ ํด๋ผ์šฐ๋“œ ์•ฑ๋ถ€ํ„ฐ ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ์™€ ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค๊นŒ์ง€. ํŒ€์ด 15๊ฐœ ์ด์ƒ์˜ ํ”Œ๋žซํผ์—์„œ ์ˆซ์ž๋ฅผ ๋‹ค๋ฃจ๊ณ  ์‹œ๊ฐ„์„ ๋งŽ์ด ์†Œ๋น„ํ•  ๋•Œ...

ETL์—์„œ ํด๋ฆฐ ์ฝ”๋“œ: Python, Go, SQL์ด ๊ฐ๊ฐ ์—ฌ๋Ÿฌ๋ถ„์—๊ฒŒ ๋‹ค๋ฅธ ์‚ฌ๊ณ  ๋ฐฉ์‹์„ ๊ฐ€๋ฅด์ณ ์ค๋‹ˆ๋‹ค

Clean Code in ETL ํ‘œ์ง€ ์ด๋ฏธ์ง€: Python, Go, SQL์ด ๊ฐ๊ฐ ์—ฌ๋Ÿฌ๋ถ„์—๊ฒŒ ๋‹ค๋ฅธ ์‚ฌ๊ณ  ๋ฐฉ์‹์„ ๊ฐ€๋ฅด์ณ ์ค๋‹ˆ๋‹ค https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cove...

๋น„๊ต3

sql SELECT 'ONLY_IN_US_1' AS location, t1.table_name FROM dba_tables t1 LEFT JOIN dba_tables t2 ON t1.table_name = t2.table_name AND t2.owner = 'GL...

WTF๋Š” ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์ง•์ด๋ž€?

Distributed Data Warehousing์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€? ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค๋Š” ์กฐ์ง์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ , ์กฐ์งํ™”ํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์ค‘์•™ ์ง‘์ค‘์‹ ์ €์žฅ์†Œ์ด๋‹ค.