12일 차: UDF vs Pandas UDF

발행: 1주 전 (2025년 12월 12일 오전 04:44 GMT+9)

2 min read

Source: Dev.to

Welcome to Day 12 of the Spark Mastery Series!

UDFs (User Defined Functions) can dramatically slow a Spark job—adding a single UDF may increase runtime by up to 10×. Understanding why and how to avoid this pitfall is essential.

UDFs (User Defined Functions)

A UDF is a Python function applied to a Spark DataFrame.

from pyspark.sql.functions import udf

@udf("string")
def reverse_name(name):
    return name[::-1]

When a normal UDF is used Spark must:

Ship each record to Python
Execute the Python code
Convert the result back to the JVM
Merge the result with the DataFrame

Each record crosses the Python ↔ JVM boundary, which is slow.

Built‑in Functions — ALWAYS Preferred

Spark’s native functions are implemented in Scala, vectorized, and optimized by Catalyst. They also support predicate push‑down and column pruning.

df.withColumn("upper_name", upper(col("name")))

Rule: If Spark provides a built‑in function, never write a UDF.

Pandas UDF — The Best Alternative to Normal UDFs

A regular UDF processes rows one‑by‑one in Python.
A Pandas UDF uses Apache Arrow to operate on whole batches (vectorized), delivering a large speed boost.

from pyspark.sql.functions import pandas_udf

@pandas_udf("double")
def multiply_by_two(col):
    return col * 2

Spark sends data in batches rather than row‑by‑row, resulting in huge performance improvements.

Types of Pandas UDFs

Scalar Pandas UDF

@pandas_udf("double")
def add_one(col):
    return col + 1

Grouped Map UDF

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def my_grouped_map(pdf):
    # custom transformation on a pandas DataFrame per group
    return pdf

Typical use cases:

Time‑series transformation
Per‑user model training
Per‑group cleaning

Grouped Aggregate UDF

@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def my_grouped_agg(pdf):
    # return a single aggregated value per group
    return pdf.mean()

Good for: statistical aggregation, ML metrics, etc.

When Should You Use a Normal UDF?

Only consider a normal UDF when:

No suitable built‑in function exists
The operation cannot be vectorized
You need extensive custom Python logic

These cases are rare in typical ETL pipelines.

Real Example: Performance Difference

Approach	Runtime
Normal UDF	50 seconds
Pandas UDF	8 seconds
Built‑in function	1 second

The stark contrast explains why senior engineers avoid normal UDFs unless absolutely necessary.

Summary Guidelines

Prefer built‑in functions whenever possible.
Use Pandas UDFs for vectorizable custom logic.
Reserve normal UDFs for truly exceptional cases.

By following these practices you’ll achieve much better performance in your Spark jobs.

12일 차: UDF vs Pandas UDF

UDFs (User Defined Functions)

Built‑in Functions — ALWAYS Preferred

Pandas UDF — The Best Alternative to Normal UDFs

Types of Pandas UDFs

Scalar Pandas UDF

Grouped Map UDF

Grouped Aggregate UDF

When Should You Use a Normal UDF?

Real Example: Performance Difference

Summary Guidelines

관련 글

우리 사이트가 싱가포르에서는 느리고 유럽에서는 완벽했는데, 그 이유는.

나는 Game Boy를 ChatGPT 안에 넣었다 (ChatGPT Apps)

Microsoft Planner를 사용하는 마케팅 매니저의 하루

spaceorbust – GitHub 커밋으로 우주 문명을 움직이는 터미널 RPG