🔥 Day 5: Introduction to DataFrames - The Most Importantce of Spark API

Published: 2 months ago (December 5, 2025 at 11:00 AM EST)

1 min read

Source: Dev.to

Source: Dev.to

What is a DataFrame?

A DataFrame in Spark is a distributed, column‑based, optimized table‑like structure used for efficient data processing.

Feels like SQL
Works like Pandas
Scales to terabytes effortlessly

Why DataFrames are better than RDDs

Use the Catalyst optimizer → rewrites queries for speed
Use the Tungsten execution engine → memory‑efficient
Support automatic code generation
Allow SQL‑like expressions
Support file formats such as Parquet, ORC, JSON, Avro

This is why almost every industry Spark job uses DataFrames.

Creating Your First DataFrame

df = spark.createDataFrame([(1, "A"), (2, "B")], ["id", "name"])
df.show()

From CSV

df = spark.read.csv("sales.csv", header=True, inferSchema=True)

From JSON

df = spark.read.json("users.json")

From Parquet (fastest!)

df = spark.read.parquet("events.parquet")

Understanding Schema

Every DataFrame has a schema (column name + data type).

df.printSchema()

Example output

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)

Schema is critical because Spark is strongly typed at runtime.

DataFrame Operations You’ll Use Daily

Select columns

df.select("name", "id").show()

Filter rows

df.filter(col("id") > 5).show()

Add new columns

df = df.withColumn("new_value", col("id") * 100)

Drop columns

df = df.drop("unwanted_column")

Rename columns

df = df.withColumnRenamed("id", "user_id")

DataFrame Actions — These Trigger Execution

df.count()
df.show()
df.collect()
df.take(5)

Tips for Best Practice

Follow for more such content.
Let me know if I missed anything in the comments.
Thank you!

Related posts

🔥 Day 3: RDDs - The Foundation of Spark

!Cover image for 🔥 Day 3: RDDs - The Foundation of Sparkhttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2...

🔥 Day 7: PySpark Joins, Unions, and GroupBy Guide

1. Joins in PySpark — The Heart of ETL Pipelines A join merges two DataFrames based on keys, similar to SQL. Basic Join python df.joindf2, df.id == df2.id, 'in...

Clean Code in ETL:How Python, Go, and SQL Each Teach You to Think Differently

!Cover image for Clean Code in ETL:How Python, Go, and SQL Each Teach You to Think Differentlyhttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cove...

Data Pipeline Tools Compared: Key Criteria to Pick the Right One

Data’s all around us — from CRM systems and cloud apps to spreadsheets and data warehouses. When teams are wrangling numbers across 15+ platforms and spending m...