Kafka Ingestion & Processing at Scale | Rajamohan Jabbala

Published: (January 7, 2026 at 03:04 AM EST)
2 min read
Source: Dev.to

Source: Dev.to

Cover image for Kafka Ingestion & Processing at Scale | Rajamohan Jabbala

Most Kafka failures don’t happen because Kafka can’t scale.
They happen because teams never did the math.

What “Good” Looks Like (SLOs First)

A production Kafka pipeline should:

  • Handle N msgs/sec per topic with low latency
  • Scale linearly via partitions, consumers, brokers
  • Guarantee at‑least‑once (or exactly‑once) semantics
  • Support fan‑out via consumer groups
  • Stay within clear lag, throughput, durability SLOs

Example SLOs

  • Produce latency p99 ≤ X ms
  • Consumer lag ≤ Y sec (steady state)
  • Recovery ≤ Z min after a 2× spike
  • Availability ≥ 99.9 %

If you don’t define these, Kafka tuning becomes superstition.

Kafka Mechanics (The Parts That Actually Matter)

  • Topics scale via partitions.
  • One partition → one consumer per group.
  • Multiple consumer groups = independent re‑reads (fan‑out).

Rule:
Max useful consumers in one group = number of partitions.
Adding more consumers than partitions does not increase throughput.

Logical Architecture

Producers → orders topic (P partitions, RF=3)

Kafka cluster distributes:
  • One leader per partition
  • Replicas across brokers

Downstream consumer groups:
  • Fraud consumer group
  • Analytics consumer group
  • ML feature consumer group

Each group owns its own offsets and scales independently.

Capacity Planning (The Non‑Negotiable Step)

Inputs

SymbolMeaning
Tmsgs/sec
Savg msg size (bytes, compressed)
Rreplication factor
Cmsgs/sec per consumer (measured)
Hheadroom (1.3–2×)
RetentionDaysretention period in days

Core Formulas

  • Partitions: P = ceil((T / C) × H)
  • Ingress: Ingress = T × S × R
  • Egress (per group): Egress = T × S
  • Storage per day (leaders): T × S × 86,400
    Multiply by R and retention days for total storage.

Example (1 M msgs/sec)

ParameterValue
T (msgs/sec)1,000,000
S (bytes)200
C (msgs/sec/consumer)25,000
H (headroom)1.5
RF (replication factor)3

Results

  • Partitions: 60
  • Ingress: ~572 MB/sec
  • Egress per consumer group: ~191 MB/sec
  • Storage for 3‑day retention (with replicas): ~155 TB

This is why “just add brokers” isn’t a strategy.

Partitioning That Doesn’t Backfire

  • Use high‑cardinality keys (e.g., order_id, not country).
  • Monitor skew aggressively.
  • Slightly over‑partition early to avoid costly re‑sharding later.

Consumer Group Scaling

  • Scale consumers up to the number of partitions (P).
  • Use separate groups for separate pipelines.
  • Autoscale on lag growth, not on raw lag values.

Reliability Defaults That Work

acks=all
min.insync.replicas=2   # with RF=3
enable.idempotence=true
unclean.leader.election.enable=false
# Rack/AZ‑aware replica placement
  • Apply exactly‑once semantics only where business semantics demand it.

Observability > Tuning

Watch

  • Lag growth per partition
  • p95/p99 produce & consume latency
  • Under‑replicated partitions
  • Disk, NIC, controller health

Scale

  • Consumers → when lag rises
  • Partitions → when consumers saturate
  • Brokers → when disk/NIC pressure increases

Final Takeaway

Kafka doesn’t scale because it’s Kafka.
It scales because you designed it to.

  • Math beats hope.
  • Measurements beat myths.
Back to Blog

Related posts

Read more »

System Design : Calendar App

Functional Requirements - Create event, modify event, cancel event - View calendar daily, weekly, or yearly - Set up recurring meetings - Send notification for...