Kafka Ingestion & Processing at Scale | Rajamohan Jabbala

Published: 1 month ago (January 7, 2026 at 03:04 AM EST)

2 min read

Source: Dev.to

Source: Dev.to

Cover image for Kafka Ingestion & Processing at Scale | Rajamohan Jabbala

Most Kafka failures don’t happen because Kafka can’t scale.
They happen because teams never did the math.

What “Good” Looks Like (SLOs First)

A production Kafka pipeline should:

Handle N msgs/sec per topic with low latency
Scale linearly via partitions, consumers, brokers
Guarantee at‑least‑once (or exactly‑once) semantics
Support fan‑out via consumer groups
Stay within clear lag, throughput, durability SLOs

Example SLOs

Produce latency p99 ≤ X ms
Consumer lag ≤ Y sec (steady state)
Recovery ≤ Z min after a 2× spike
Availability ≥ 99.9 %

If you don’t define these, Kafka tuning becomes superstition.

Kafka Mechanics (The Parts That Actually Matter)

Topics scale via partitions.
One partition → one consumer per group.
Multiple consumer groups = independent re‑reads (fan‑out).

Rule:
Max useful consumers in one group = number of partitions.
Adding more consumers than partitions does not increase throughput.

Logical Architecture

Producers → orders topic (P partitions, RF=3)

Kafka cluster distributes:
  • One leader per partition
  • Replicas across brokers

Downstream consumer groups:
  • Fraud consumer group
  • Analytics consumer group
  • ML feature consumer group

Each group owns its own offsets and scales independently.

Capacity Planning (The Non‑Negotiable Step)

Inputs

Symbol	Meaning
T	msgs/sec
S	avg msg size (bytes, compressed)
R	replication factor
C	msgs/sec per consumer (measured)
H	headroom (1.3–2×)
RetentionDays	retention period in days

Core Formulas

Partitions: P = ceil((T / C) × H)
Ingress: Ingress = T × S × R
Egress (per group): Egress = T × S
Storage per day (leaders): T × S × 86,400
Multiply by R and retention days for total storage.

Example (1 M msgs/sec)

Parameter	Value
T (msgs/sec)	1,000,000
S (bytes)	200
C (msgs/sec/consumer)	25,000
H (headroom)	1.5
RF (replication factor)	3

Results

Partitions: 60
Ingress: ~572 MB/sec
Egress per consumer group: ~191 MB/sec
Storage for 3‑day retention (with replicas): ~155 TB

This is why “just add brokers” isn’t a strategy.

Partitioning That Doesn’t Backfire

Use high‑cardinality keys (e.g., order_id, not country).
Monitor skew aggressively.
Slightly over‑partition early to avoid costly re‑sharding later.

Consumer Group Scaling

Scale consumers up to the number of partitions (P).
Use separate groups for separate pipelines.
Autoscale on lag growth, not on raw lag values.

Reliability Defaults That Work

acks=all
min.insync.replicas=2   # with RF=3
enable.idempotence=true
unclean.leader.election.enable=false
# Rack/AZ‑aware replica placement

Apply exactly‑once semantics only where business semantics demand it.

Observability > Tuning

Watch

Lag growth per partition
p95/p99 produce & consume latency
Under‑replicated partitions
Disk, NIC, controller health

Scale

Consumers → when lag rises
Partitions → when consumers saturate
Brokers → when disk/NIC pressure increases

Final Takeaway

Kafka doesn’t scale because it’s Kafka.
It scales because you designed it to.

Math beats hope.
Measurements beat myths.

Kafka Ingestion & Processing at Scale | Rajamohan Jabbala

What “Good” Looks Like (SLOs First)

Kafka Mechanics (The Parts That Actually Matter)

Logical Architecture

Capacity Planning (The Non‑Negotiable Step)

Inputs

Core Formulas

Example (1 M msgs/sec)

Partitioning That Doesn’t Backfire

Consumer Group Scaling

Reliability Defaults That Work

Observability > Tuning

Final Takeaway

Related posts

Beyond the Buzzwords: 5 Counter-Intuitive Lessons in System Design

“Just Add Caching” Is Usually the Wrong Answer

Why using Elasticsearch was bad because I needed real-time data retrieval, not just fast searching

Why most businesses don’t actually need a “unique” backend

What “Good” Looks Like (SLOs First)

Kafka Mechanics (The Parts That Actually Matter)

Logical Architecture

Capacity Planning (The Non‑Negotiable Step)

Inputs

Core Formulas

Example (1 M msgs/sec)

Partitioning That Doesn’t Backfire

Consumer Group Scaling

Reliability Defaults That Work

Observability > Tuning

Final Takeaway

Related posts

Beyond the Buzzwords: 5 Counter-Intuitive Lessons in System Design

“Just Add Caching” Is Usually the Wrong Answer

Why using Elasticsearch was bad because I needed real-time data retrieval, not just fast searching

Why most businesses don’t actually need a “unique” backend

Example (1 M msgs/sec)