🔥 Day 3: RDDs - The Foundation of Spark

Published: (December 3, 2025 at 12:55 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Cover image for 🔥 Day 3: RDDs - The Foundation of Spark

What Exactly Is an RDD?

An RDD is a distributed collection of immutable data partitioned across the cluster.

Key properties

  • Immutable
  • Lazy evaluated
  • Distributed
  • Fault‑tolerant
  • Parallelized

RDDs were the first abstraction in Spark before DataFrames and Datasets existed.

Why Should You Learn RDDs?

Even though DataFrames are recommended now, RDDs remain crucial for:

  • Understanding execution plans
  • Debugging shuffles
  • Improving partition strategies
  • Designing performance‑efficient pipelines
  • Handling non‑structured data

How to Create RDDs

From Python Lists

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

From File

rdd = spark.sparkContext.textFile("sales.txt")

RDD Transformations (Lazy)

Transformations build the DAG. Common examples:

rdd.map(lambda x: x * 2)
rdd.filter(lambda x: x > 10)
rdd.flatMap(lambda x: x.split(","))

You can chain transformations; Spark will not execute anything until an action is called.

RDD Actions (Execute Plan)

Actions trigger job execution:

rdd.collect()
rdd.count()
rdd.take(5)
rdd.saveAsTextFile("output")

Narrow vs Wide Transformations

Narrow (No Shuffle)

Output partition depends only on one input partition – fast.

  • map
  • filter
  • union

Wide (Shuffle Required)

Output depends on multiple partitions – slower; creates a new stage.

  • groupByKey
  • join
  • reduceByKey

Shuffles are a major cause of slow Spark jobs.

RDD Lineage — Fault Tolerance in Action

Each RDD tracks how it was created.

rdd1 = rdd.map(...)
rdd2 = rdd1.filter(...)
rdd3 = rdd2.reduceByKey(...)

If a node dies, Spark reconstructs the lost data using this lineage information.

Persistence and Caching

When you reuse an RDD, persist it to avoid recomputation:

processed = rdd.map(...)
processed.persist()
processed.count()
processed.filter(...)

Spark will read from memory instead of recomputing the RDD.

Summary

  • What RDDs are
  • Why they matter
  • Transformations vs. actions
  • Narrow vs. wide transformations
  • Lineage for fault tolerance
  • Caching and persistence

These concepts form the foundation of Spark internals.

Back to Blog

Related posts

Read more »

WTF is Distributed Data Warehousing?

What is Distributed Data Warehousing? A data warehouse is a centralized repository where an organization stores, organizes, and makes data readily available fo...

Data warehouse without using SQL

Currently the vast majority of data warehouses employ SQL to process data. After decades of development, SQL has become the standard language in the database wo...