Kafka Consumer Health Checks: Dead or Alive

Published: (December 13, 2025 at 08:56 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Many of you have been there – it’s 3 AM, your phone buzzes, and you’re staring at an alert: Kafka consumer lag exceeded threshold.
You stumble to your laptop, check the metrics, and… everything looks fine. The messages are processing. The lag was just a temporary spike from a slow downstream service. You silence the alert and go back to bed, mentally adding another item to your “fix someday” list.

Sound familiar?

Here’s what’s actually happening: lag tells you how far behind you are, but not whether you’re making progress. A consumer can sit at 1 000 messages lag for 10 minutes because it’s stuck, or because it’s processing at exactly the rate messages arrive. From lag alone, you can’t tell the difference.

The real question isn’t “how much lag?” — it’s “are we making progress?”


Why Traditional Health Checks Fall Short

The “Always Healthy” Approach

func healthHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK")) // "I'm alive!" (Are you though?)
}

This tells you nothing. Your consumer could be completely frozen, and Kubernetes (or any orchestrator) would happily keep it alive.

The “Ping the Broker” Approach

Verifying connectivity is better than nothing, but connectivity doesn’t mean your consumer is processing messages. Your network might be fine while your consumer group is stuck in an infinite rebalance loop.

The “Lag Threshold” Trap

Most teams monitor consumer lag through Prometheus or similar tools and set alerts when lag exceeds a threshold – maybe 100 messages, maybe 1 000, maybe a million.

What should that threshold be?

  • Set it too low → You wake up at 3 AM because a downstream API responded slowly for 30 seconds. The consumer is fine, just briefly overwhelmed. False alarm.
  • Set it too high → You miss genuine issues until they’ve already cascaded into customer‑visible problems. By the time your alert fires, you’re in damage‑control mode.

The granularity problem: A single stuck pod in a deployment of 50 can go unnoticed if you’re monitoring average lag. That pod quietly fails while the other 49 keep working, masking the issue in your aggregate metrics.

The fundamental tension: fast detection means false positives, reliable alerts mean delayed detection. Sophisticated heuristics or ML models add complexity and still detect problems with noticeable delay.

The Key Insight: Progress vs. Position

What we really need is a way to answer a simple question: Is this specific consumer instance making progress?

Instead of measuring lag, we measure progress.

  1. Heartbeat – Track the timestamp of the last processed message for each partition. If we’re processing messages, we’re healthy.
  2. Verification – If enough time passes without new messages (say, X seconds), we query the Kafka broker for the latest offset.
ComparisonResult
Consumer offset < Broker offsetUNHEALTHY – messages are available but not being processed (stuck)
Consumer offset ≥ Broker offsetHEALTHY – caught up, just idle

This cleanly distinguishes a stuck consumer from an idle one.

How It Works: Three Scenarios

Scenario 1: Active Processing (Healthy)

When messages are flowing and your consumer is processing them, health checks are instant – no broker queries needed because recent activity has been observed.

Scenario 1: Active Processing

Scenario 2: Stuck Consumer (Unhealthy)

The consumer freezes while messages keep arriving. The broker query reveals there’s work to do, but the consumer isn’t processing it.

Scenario 2: Stuck Consumer

Scenario 3: Idle Consumer (Healthy)

The consumer is caught up and simply waiting for new messages.

Scenario 3: Idle Consumer

The same timeout leads to different outcomes based on broker state.

Implementation

I’ve packaged this logic into kafka-pulse-go, a lightweight library that works with most popular Kafka clients. The core logic is decoupled from specific client implementations, with adapters for:

Setting Up the Monitor

Below is a minimal example using Sarama:

import (
    "log"
    "time"

    "github.com/IBM/sarama"
    adapter "github.com/vmyroslav/kafka-pulse-go/adapter/sarama"
    "github.com/vmyroslav/kafka-pulse-go/pulse"
)

func main() {
    // Your existing Sarama setup
    config := sarama.NewConfig()
    client, err := sarama.NewClient([]string{"broker1:9092", "broker2:9092"}, config)
    if err != nil {
        log.Fatal(err)
    }

    // Create the health‑checker adapter
    brokerClient := adapter.NewClientAdapter(client)

    // Configure the monitor
    monitorConfig := pulse.Config{
        Logger:       log.Default(),
        StuckTimeout: 30 * time.Second,
    }

    monitor, err := pulse.NewHealthChecker(monitorConfig, brokerClient)
    if err != nil {
        log.Fatal(err)
    }

    // Pass `monitor` to your consumer and expose its health endpoint
}

What is StuckTimeout?
It defines how long the monitor will wait without seeing new messages before querying the broker. Choose a value based on your topic’s typical message frequency:

  • High‑throughput topics: 10–30 seconds
  • Medium‑volume topics: 1–2 minutes
  • Low‑volume topics: 5–10 minutes

Adjusting this timeout balances sensitivity (quick detection) against false positives (temporary pauses).

💡 Want to dive straight into the code? Check out the complete source code here.

Back to Blog

Related posts

Read more »

What I’m trying to understand

Background I’m a software engineer at a mid‑level, and I started my career building web applications, mostly with Ruby on Rails. Over time I worked with other...