🔥 Day 3: RDDs - The Foundation of Spark
Source: Dev.to

What Exactly Is an RDD?
An RDD is a distributed collection of immutable data partitioned across the cluster.
Key properties
- Immutable
- Lazy evaluated
- Distributed
- Fault‑tolerant
- Parallelized
RDDs were the first abstraction in Spark before DataFrames and Datasets existed.
Why Should You Learn RDDs?
Even though DataFrames are recommended now, RDDs remain crucial for:
- Understanding execution plans
- Debugging shuffles
- Improving partition strategies
- Designing performance‑efficient pipelines
- Handling non‑structured data
How to Create RDDs
From Python Lists
rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
From File
rdd = spark.sparkContext.textFile("sales.txt")
RDD Transformations (Lazy)
Transformations build the DAG. Common examples:
rdd.map(lambda x: x * 2)
rdd.filter(lambda x: x > 10)
rdd.flatMap(lambda x: x.split(","))
You can chain transformations; Spark will not execute anything until an action is called.
RDD Actions (Execute Plan)
Actions trigger job execution:
rdd.collect()
rdd.count()
rdd.take(5)
rdd.saveAsTextFile("output")
Narrow vs Wide Transformations
Narrow (No Shuffle)
Output partition depends only on one input partition – fast.
mapfilterunion
Wide (Shuffle Required)
Output depends on multiple partitions – slower; creates a new stage.
groupByKeyjoinreduceByKey
Shuffles are a major cause of slow Spark jobs.
RDD Lineage — Fault Tolerance in Action
Each RDD tracks how it was created.
rdd1 = rdd.map(...)
rdd2 = rdd1.filter(...)
rdd3 = rdd2.reduceByKey(...)
If a node dies, Spark reconstructs the lost data using this lineage information.
Persistence and Caching
When you reuse an RDD, persist it to avoid recomputation:
processed = rdd.map(...)
processed.persist()
processed.count()
processed.filter(...)
Spark will read from memory instead of recomputing the RDD.
Summary
- What RDDs are
- Why they matter
- Transformations vs. actions
- Narrow vs. wide transformations
- Lineage for fault tolerance
- Caching and persistence
These concepts form the foundation of Spark internals.