🔥 Day 3: RDDs - The Foundation of Spark

Published: 1 day ago (December 3, 2025 at 12:55 PM EST)

2 min read

Source: Dev.to

Source: Dev.to

Cover image for 🔥 Day 3: RDDs - The Foundation of Spark

What Exactly Is an RDD?

An RDD is a distributed collection of immutable data partitioned across the cluster.

Key properties

Immutable
Lazy evaluated
Distributed
Fault‑tolerant
Parallelized

RDDs were the first abstraction in Spark before DataFrames and Datasets existed.

Why Should You Learn RDDs?

Even though DataFrames are recommended now, RDDs remain crucial for:

Understanding execution plans
Debugging shuffles
Improving partition strategies
Designing performance‑efficient pipelines
Handling non‑structured data

How to Create RDDs

From Python Lists

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

From File

rdd = spark.sparkContext.textFile("sales.txt")

RDD Transformations (Lazy)

Transformations build the DAG. Common examples:

rdd.map(lambda x: x * 2)
rdd.filter(lambda x: x > 10)
rdd.flatMap(lambda x: x.split(","))

You can chain transformations; Spark will not execute anything until an action is called.

RDD Actions (Execute Plan)

Actions trigger job execution:

rdd.collect()
rdd.count()
rdd.take(5)
rdd.saveAsTextFile("output")

Narrow vs Wide Transformations

Narrow (No Shuffle)

Output partition depends only on one input partition – fast.

map
filter
union

Wide (Shuffle Required)

Output depends on multiple partitions – slower; creates a new stage.

groupByKey
join
reduceByKey

Shuffles are a major cause of slow Spark jobs.

RDD Lineage — Fault Tolerance in Action

Each RDD tracks how it was created.

rdd1 = rdd.map(...)
rdd2 = rdd1.filter(...)
rdd3 = rdd2.reduceByKey(...)

If a node dies, Spark reconstructs the lost data using this lineage information.

Persistence and Caching

When you reuse an RDD, persist it to avoid recomputation:

processed = rdd.map(...)
processed.persist()
processed.count()
processed.filter(...)

Spark will read from memory instead of recomputing the RDD.

Summary

What RDDs are
Why they matter
Transformations vs. actions
Narrow vs. wide transformations
Lineage for fault tolerance
Caching and persistence

These concepts form the foundation of Spark internals.