š„ Day 5: Introduction to DataFrames - The Most Importantce of Spark API
Source: Dev.to
What is a DataFrame?
A DataFrame in Spark is a distributed, columnābased, optimized tableālike structure used for efficient data processing.
- Feels like SQL
- Works like Pandas
- Scales to terabytes effortlessly
Why DataFrames are better than RDDs
- Use the Catalyst optimizer ā rewrites queries for speed
- Use the Tungsten execution engine ā memoryāefficient
- Support automatic code generation
- Allow SQLālike expressions
- Support file formats such as Parquet, ORC, JSON, Avro
This is why almost every industry Spark job uses DataFrames.
Creating Your First DataFrame
df = spark.createDataFrame([(1, "A"), (2, "B")], ["id", "name"])
df.show()
From CSV
df = spark.read.csv("sales.csv", header=True, inferSchema=True)
From JSON
df = spark.read.json("users.json")
From Parquet (fastest!)
df = spark.read.parquet("events.parquet")
Understanding Schema
Every DataFrame has a schema (column name + data type).
df.printSchema()
Example output
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
Schema is critical because Spark is strongly typed at runtime.
DataFrame Operations Youāll Use Daily
Select columns
df.select("name", "id").show()
Filter rows
df.filter(col("id") > 5).show()
Add new columns
df = df.withColumn("new_value", col("id") * 100)
Drop columns
df = df.drop("unwanted_column")
Rename columns
df = df.withColumnRenamed("id", "user_id")
DataFrame Actions ā These Trigger Execution
df.count()
df.show()
df.collect()
df.take(5)
Tips for Best Practice
- Follow for more such content.
- Let me know if I missed anything in the comments.
- Thank you!