🔥 Day 3: RDDs - The Foundation of Spark

Published: (December 3, 2025 at 12:55 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Cover image for 🔥 Day 3: RDDs - The Foundation of Spark

What Exactly Is an RDD?

An RDD is a distributed collection of immutable data partitioned across the cluster.

Key properties

  • Immutable
  • Lazy evaluated
  • Distributed
  • Fault‑tolerant
  • Parallelized

RDDs were the first abstraction in Spark before DataFrames and Datasets existed.

Why Should You Learn RDDs?

Even though DataFrames are recommended now, RDDs remain crucial for:

  • Understanding execution plans
  • Debugging shuffles
  • Improving partition strategies
  • Designing performance‑efficient pipelines
  • Handling non‑structured data

How to Create RDDs

From Python Lists

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

From File

rdd = spark.sparkContext.textFile("sales.txt")

RDD Transformations (Lazy)

Transformations build the DAG. Common examples:

rdd.map(lambda x: x * 2)
rdd.filter(lambda x: x > 10)
rdd.flatMap(lambda x: x.split(","))

You can chain transformations; Spark will not execute anything until an action is called.

RDD Actions (Execute Plan)

Actions trigger job execution:

rdd.collect()
rdd.count()
rdd.take(5)
rdd.saveAsTextFile("output")

Narrow vs Wide Transformations

Narrow (No Shuffle)

Output partition depends only on one input partition – fast.

  • map
  • filter
  • union

Wide (Shuffle Required)

Output depends on multiple partitions – slower; creates a new stage.

  • groupByKey
  • join
  • reduceByKey

Shuffles are a major cause of slow Spark jobs.

RDD Lineage — Fault Tolerance in Action

Each RDD tracks how it was created.

rdd1 = rdd.map(...)
rdd2 = rdd1.filter(...)
rdd3 = rdd2.reduceByKey(...)

If a node dies, Spark reconstructs the lost data using this lineage information.

Persistence and Caching

When you reuse an RDD, persist it to avoid recomputation:

processed = rdd.map(...)
processed.persist()
processed.count()
processed.filter(...)

Spark will read from memory instead of recomputing the RDD.

Summary

  • What RDDs are
  • Why they matter
  • Transformations vs. actions
  • Narrow vs. wide transformations
  • Lineage for fault tolerance
  • Caching and persistence

These concepts form the foundation of Spark internals.

Back to Blog

Related posts

Read more »

WTF is Distributed Data Warehousing?

What is Distributed Data Warehousing? A data warehouse is a centralized repository where an organization stores, organizes, and makes data readily available fo...