12일 차: UDF vs Pandas UDF

발행: (2025년 12월 12일 오전 04:44 GMT+9)
2 min read
원문: Dev.to

Source: Dev.to

Welcome to Day 12 of the Spark Mastery Series!

UDFs (User Defined Functions) can dramatically slow a Spark job—adding a single UDF may increase runtime by up to 10×. Understanding why and how to avoid this pitfall is essential.

UDFs (User Defined Functions)

A UDF is a Python function applied to a Spark DataFrame.

from pyspark.sql.functions import udf

@udf("string")
def reverse_name(name):
    return name[::-1]

When a normal UDF is used Spark must:

  1. Ship each record to Python
  2. Execute the Python code
  3. Convert the result back to the JVM
  4. Merge the result with the DataFrame

Each record crosses the Python ↔ JVM boundary, which is slow.

Built‑in Functions — ALWAYS Preferred

Spark’s native functions are implemented in Scala, vectorized, and optimized by Catalyst. They also support predicate push‑down and column pruning.

df.withColumn("upper_name", upper(col("name")))

Rule: If Spark provides a built‑in function, never write a UDF.

Pandas UDF — The Best Alternative to Normal UDFs

A regular UDF processes rows one‑by‑one in Python.
A Pandas UDF uses Apache Arrow to operate on whole batches (vectorized), delivering a large speed boost.

from pyspark.sql.functions import pandas_udf

@pandas_udf("double")
def multiply_by_two(col):
    return col * 2

Spark sends data in batches rather than row‑by‑row, resulting in huge performance improvements.

Types of Pandas UDFs

Scalar Pandas UDF

@pandas_udf("double")
def add_one(col):
    return col + 1

Grouped Map UDF

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def my_grouped_map(pdf):
    # custom transformation on a pandas DataFrame per group
    return pdf

Typical use cases:

  • Time‑series transformation
  • Per‑user model training
  • Per‑group cleaning

Grouped Aggregate UDF

@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def my_grouped_agg(pdf):
    # return a single aggregated value per group
    return pdf.mean()

Good for: statistical aggregation, ML metrics, etc.

When Should You Use a Normal UDF?

Only consider a normal UDF when:

  • No suitable built‑in function exists
  • The operation cannot be vectorized
  • You need extensive custom Python logic

These cases are rare in typical ETL pipelines.

Real Example: Performance Difference

ApproachRuntime
Normal UDF50 seconds
Pandas UDF8 seconds
Built‑in function1 second

The stark contrast explains why senior engineers avoid normal UDFs unless absolutely necessary.

Summary Guidelines

  • Prefer built‑in functions whenever possible.
  • Use Pandas UDFs for vectorizable custom logic.
  • Reserve normal UDFs for truly exceptional cases.

By following these practices you’ll achieve much better performance in your Spark jobs.

Back to Blog

관련 글

더 보기 »