Kafka Ingestion & Processing at Scale | Rajamohan Jabbala
Source: Dev.to

Most Kafka failures don’t happen because Kafka can’t scale.
They happen because teams never did the math.
What “Good” Looks Like (SLOs First)
A production Kafka pipeline should:
- Handle N msgs/sec per topic with low latency
- Scale linearly via partitions, consumers, brokers
- Guarantee at‑least‑once (or exactly‑once) semantics
- Support fan‑out via consumer groups
- Stay within clear lag, throughput, durability SLOs
Example SLOs
- Produce latency p99 ≤ X ms
- Consumer lag ≤ Y sec (steady state)
- Recovery ≤ Z min after a 2× spike
- Availability ≥ 99.9 %
If you don’t define these, Kafka tuning becomes superstition.
Kafka Mechanics (The Parts That Actually Matter)
- Topics scale via partitions.
- One partition → one consumer per group.
- Multiple consumer groups = independent re‑reads (fan‑out).
Rule:
Max useful consumers in one group = number of partitions.
Adding more consumers than partitions does not increase throughput.
Logical Architecture
Producers → orders topic (P partitions, RF=3)
Kafka cluster distributes:
• One leader per partition
• Replicas across brokers
Downstream consumer groups:
• Fraud consumer group
• Analytics consumer group
• ML feature consumer group
Each group owns its own offsets and scales independently.
Capacity Planning (The Non‑Negotiable Step)
Inputs
| Symbol | Meaning |
|---|---|
| T | msgs/sec |
| S | avg msg size (bytes, compressed) |
| R | replication factor |
| C | msgs/sec per consumer (measured) |
| H | headroom (1.3–2×) |
| RetentionDays | retention period in days |
Core Formulas
- Partitions:
P = ceil((T / C) × H) - Ingress:
Ingress = T × S × R - Egress (per group):
Egress = T × S - Storage per day (leaders):
T × S × 86,400
Multiply byRand retention days for total storage.
Example (1 M msgs/sec)
| Parameter | Value |
|---|---|
| T (msgs/sec) | 1,000,000 |
| S (bytes) | 200 |
| C (msgs/sec/consumer) | 25,000 |
| H (headroom) | 1.5 |
| RF (replication factor) | 3 |
Results
- Partitions: 60
- Ingress: ~572 MB/sec
- Egress per consumer group: ~191 MB/sec
- Storage for 3‑day retention (with replicas): ~155 TB
This is why “just add brokers” isn’t a strategy.
Partitioning That Doesn’t Backfire
- Use high‑cardinality keys (e.g.,
order_id, notcountry). - Monitor skew aggressively.
- Slightly over‑partition early to avoid costly re‑sharding later.
Consumer Group Scaling
- Scale consumers up to the number of partitions (
P). - Use separate groups for separate pipelines.
- Autoscale on lag growth, not on raw lag values.
Reliability Defaults That Work
acks=all
min.insync.replicas=2 # with RF=3
enable.idempotence=true
unclean.leader.election.enable=false
# Rack/AZ‑aware replica placement
- Apply exactly‑once semantics only where business semantics demand it.
Observability > Tuning
Watch
- Lag growth per partition
- p95/p99 produce & consume latency
- Under‑replicated partitions
- Disk, NIC, controller health
Scale
- Consumers → when lag rises
- Partitions → when consumers saturate
- Brokers → when disk/NIC pressure increases
Final Takeaway
Kafka doesn’t scale because it’s Kafka.
It scales because you designed it to.
- Math beats hope.
- Measurements beat myths.