SQL on Kafka Data Does Not Require a Streaming Engine

Published: (January 14, 2026 at 02:55 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Introduction

Stream processing engines solved a real problem: continuous computation over unbounded data. Flink, ksqlDB, and Kafka Streams gave teams a way to run SQL‑like queries against event streams without writing custom consumers.

The operational cost of that solution is widely acknowledged. Confluent’s own documentation notes that Flink “poses difficulties with deployment and cluster operations, such as tuning performance or resolving checkpoint failures” and that “organizations using Flink tend to require teams of experts dedicated to developing and maintaining it.”

For a large share of the questions teams ask of their Kafka data, a simpler architecture exists: SQL on immutable segments in object storage.

Typical Kafka Queries

In production debugging sessions and ops reviews, the questions are repetitive:

  • What is in this topic right now?
  • What happened around an incident window?
  • Where is the message with this key?
  • Are all partitions still producing data?

These are not streaming problems. They are bounded lookups over historical data. They run once, terminate, and do not need windows, watermarks, checkpoints, or state recovery.

Kafka’s Storage Model

  • Kafka does not persist records individually. It appends them to log segments and rolls those segments by size or time.
  • Each partition is an ordered, immutable sequence of records. Once a segment is closed, it is immutable.
  • Kafka maintains sparse indexes so readers can seek by offset and timestamp efficiently. Each segment file is accompanied by lightweight offset and timestamp indexes that allow consumers to seek directly to specific message positions without scanning entire files.
  • Retention deletes whole segments; compaction rewrites segments. This means Kafka data is already organized like a SQL‑on‑files dataset. The only difference is where the files live.

Since Kafka 3.6.0, tiered storage allows these segments to live in object storage (e.g., S3). As of Kafka 3.9.0, the feature is production‑ready, decoupling durability from compute without changing the data model.

Overhead of Streaming Engines

Streaming engines pay for capabilities most queries never use:

  • Distributed state backends
  • Coordinated checkpoints
  • Watermark tracking
  • Long‑running cluster operations

That cost is justified for continuous aggregation, joins, and real‑time inference, but it is wasted for “show me the last 10 messages”.

Production Experience

Riskified migrated from ksqlDB to Flink, noting that ksqlDB’s strict limitations on evolving schemas made it impractical for real‑world production use cases and that operational complexity required fighting the system more than working with it.

Vendor surveys from Confluent and Redpanda show that approximately 56 % of all Kafka clusters run at or below 1 MB/s. Most Kafka usage is small‑data, yet teams pay big‑data operational costs.

Query Planning on Immutable Segments

If Kafka data lives as immutable segments with sparse indexes, querying it looks like any other SQL‑on‑files workload.

Query planner steps

  1. Resolve the topic to segment files.
  2. Filter by timestamp or offset metadata.
  3. Read only relevant segments.
  4. Apply predicates and return results.

There are no consumer groups, no offset commits, and no streaming job lifecycle.

Example Queries

-- Last 10 messages
SELECT * FROM orders TAIL 10;
-- Time‑bounded scan
SELECT * FROM orders
WHERE ts BETWEEN '2026-01-08 09:00' AND '2026-01-08 09:05';
-- Key lookup with recent window
SELECT * FROM orders
WHERE key = 'order-12345'
  AND ts >= now() - interval '24 hours';

These are indexed file accesses with SQL semantics, not stream processing.

Performance Considerations

Object storage is slower than broker‑local disk; remote storage typically has higher latency than local block storage. For most debugging and ops workflows, a one‑ or two‑second latency is acceptable, whereas waiting minutes to deploy or restart a streaming job is not.

If you need sub‑second continuous results, use a streaming engine. That boundary is clear.

Cost Management

The main risk with SQL on object storage is unbounded scans. Object storage pricing is based on the amount of data stored and the number of API calls made.

A responsible system should make each query report:

  • How many segments will be read
  • How many bytes will be scanned
  • The estimated request cost

Queries without time bounds should require explicit opt‑in, keeping cost a conscious decision instead of a surprise.

When to Use Streaming Engines

Streaming engines remain the right tool for:

  • Continuous aggregations
  • Joins over live streams
  • Real‑time scoring
  • Exactly‑once outputs

Most Kafka interactions are not those. They are lookups and inspections that were forced into streaming infrastructure because no better interface existed.

Conclusion

Once Kafka data is durable as immutable segments, SQL becomes the simpler tool. Most teams do not need a streaming engine to answer Kafka questions; they need a clean, bounded way to query immutable data. SQL on Kafka segments provides exactly that.

Read a deeper post at .

Back to Blog

Related posts

Read more »

𝗗𝗲𝘀𝗶𝗴𝗻𝗲𝗱 𝗮 𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻‑𝗥𝗲𝗮𝗱𝘆 𝗠𝘂𝗹𝘁𝗶‑𝗥𝗲𝗴𝗶𝗼𝗻 𝗔𝗪𝗦 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 𝗘𝗞𝗦 | 𝗖𝗜/𝗖𝗗 | 𝗖𝗮𝗻𝗮𝗿𝘆 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁𝘀 | 𝗗𝗥 𝗙𝗮𝗶𝗹𝗼𝘃𝗲𝗿

!Architecture Diagramhttps://dev-to-uploads.s3.amazonaws.com/uploads/articles/p20jqk5gukphtqbsnftb.gif I designed a production‑grade multi‑region AWS architectu...