12일 차: UDF vs Pandas UDF
Source: Dev.to
Welcome to Day 12 of the Spark Mastery Series!
UDFs (User Defined Functions) can dramatically slow a Spark job—adding a single UDF may increase runtime by up to 10×. Understanding why and how to avoid this pitfall is essential.
UDFs (User Defined Functions)
A UDF is a Python function applied to a Spark DataFrame.
from pyspark.sql.functions import udf
@udf("string")
def reverse_name(name):
return name[::-1]
When a normal UDF is used Spark must:
- Ship each record to Python
- Execute the Python code
- Convert the result back to the JVM
- Merge the result with the DataFrame
Each record crosses the Python ↔ JVM boundary, which is slow.
Built‑in Functions — ALWAYS Preferred
Spark’s native functions are implemented in Scala, vectorized, and optimized by Catalyst. They also support predicate push‑down and column pruning.
df.withColumn("upper_name", upper(col("name")))
Rule: If Spark provides a built‑in function, never write a UDF.
Pandas UDF — The Best Alternative to Normal UDFs
A regular UDF processes rows one‑by‑one in Python.
A Pandas UDF uses Apache Arrow to operate on whole batches (vectorized), delivering a large speed boost.
from pyspark.sql.functions import pandas_udf
@pandas_udf("double")
def multiply_by_two(col):
return col * 2
Spark sends data in batches rather than row‑by‑row, resulting in huge performance improvements.
Types of Pandas UDFs
Scalar Pandas UDF
@pandas_udf("double")
def add_one(col):
return col + 1
Grouped Map UDF
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def my_grouped_map(pdf):
# custom transformation on a pandas DataFrame per group
return pdf
Typical use cases:
- Time‑series transformation
- Per‑user model training
- Per‑group cleaning
Grouped Aggregate UDF
@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def my_grouped_agg(pdf):
# return a single aggregated value per group
return pdf.mean()
Good for: statistical aggregation, ML metrics, etc.
When Should You Use a Normal UDF?
Only consider a normal UDF when:
- No suitable built‑in function exists
- The operation cannot be vectorized
- You need extensive custom Python logic
These cases are rare in typical ETL pipelines.
Real Example: Performance Difference
| Approach | Runtime |
|---|---|
| Normal UDF | 50 seconds |
| Pandas UDF | 8 seconds |
| Built‑in function | 1 second |
The stark contrast explains why senior engineers avoid normal UDFs unless absolutely necessary.
Summary Guidelines
- Prefer built‑in functions whenever possible.
- Use Pandas UDFs for vectorizable custom logic.
- Reserve normal UDFs for truly exceptional cases.
By following these practices you’ll achieve much better performance in your Spark jobs.