SQL on Kafka Data Does Not Require a Streaming Engine
Source: Dev.to
Introduction
Stream processing engines solved a real problem: continuous computation over unbounded data. Flink, ksqlDB, and Kafka Streams gave teams a way to run SQL‑like queries against event streams without writing custom consumers.
The operational cost of that solution is widely acknowledged. Confluent’s own documentation notes that Flink “poses difficulties with deployment and cluster operations, such as tuning performance or resolving checkpoint failures” and that “organizations using Flink tend to require teams of experts dedicated to developing and maintaining it.”
For a large share of the questions teams ask of their Kafka data, a simpler architecture exists: SQL on immutable segments in object storage.
Typical Kafka Queries
In production debugging sessions and ops reviews, the questions are repetitive:
- What is in this topic right now?
- What happened around an incident window?
- Where is the message with this key?
- Are all partitions still producing data?
These are not streaming problems. They are bounded lookups over historical data. They run once, terminate, and do not need windows, watermarks, checkpoints, or state recovery.
Kafka’s Storage Model
- Kafka does not persist records individually. It appends them to log segments and rolls those segments by size or time.
- Each partition is an ordered, immutable sequence of records. Once a segment is closed, it is immutable.
- Kafka maintains sparse indexes so readers can seek by offset and timestamp efficiently. Each segment file is accompanied by lightweight offset and timestamp indexes that allow consumers to seek directly to specific message positions without scanning entire files.
- Retention deletes whole segments; compaction rewrites segments. This means Kafka data is already organized like a SQL‑on‑files dataset. The only difference is where the files live.
Since Kafka 3.6.0, tiered storage allows these segments to live in object storage (e.g., S3). As of Kafka 3.9.0, the feature is production‑ready, decoupling durability from compute without changing the data model.
Overhead of Streaming Engines
Streaming engines pay for capabilities most queries never use:
- Distributed state backends
- Coordinated checkpoints
- Watermark tracking
- Long‑running cluster operations
That cost is justified for continuous aggregation, joins, and real‑time inference, but it is wasted for “show me the last 10 messages”.
Production Experience
Riskified migrated from ksqlDB to Flink, noting that ksqlDB’s strict limitations on evolving schemas made it impractical for real‑world production use cases and that operational complexity required fighting the system more than working with it.
Vendor surveys from Confluent and Redpanda show that approximately 56 % of all Kafka clusters run at or below 1 MB/s. Most Kafka usage is small‑data, yet teams pay big‑data operational costs.
Query Planning on Immutable Segments
If Kafka data lives as immutable segments with sparse indexes, querying it looks like any other SQL‑on‑files workload.
Query planner steps
- Resolve the topic to segment files.
- Filter by timestamp or offset metadata.
- Read only relevant segments.
- Apply predicates and return results.
There are no consumer groups, no offset commits, and no streaming job lifecycle.
Example Queries
-- Last 10 messages
SELECT * FROM orders TAIL 10;
-- Time‑bounded scan
SELECT * FROM orders
WHERE ts BETWEEN '2026-01-08 09:00' AND '2026-01-08 09:05';
-- Key lookup with recent window
SELECT * FROM orders
WHERE key = 'order-12345'
AND ts >= now() - interval '24 hours';
These are indexed file accesses with SQL semantics, not stream processing.
Performance Considerations
Object storage is slower than broker‑local disk; remote storage typically has higher latency than local block storage. For most debugging and ops workflows, a one‑ or two‑second latency is acceptable, whereas waiting minutes to deploy or restart a streaming job is not.
If you need sub‑second continuous results, use a streaming engine. That boundary is clear.
Cost Management
The main risk with SQL on object storage is unbounded scans. Object storage pricing is based on the amount of data stored and the number of API calls made.
A responsible system should make each query report:
- How many segments will be read
- How many bytes will be scanned
- The estimated request cost
Queries without time bounds should require explicit opt‑in, keeping cost a conscious decision instead of a surprise.
When to Use Streaming Engines
Streaming engines remain the right tool for:
- Continuous aggregations
- Joins over live streams
- Real‑time scoring
- Exactly‑once outputs
Most Kafka interactions are not those. They are lookups and inspections that were forced into streaming infrastructure because no better interface existed.
Conclusion
Once Kafka data is durable as immutable segments, SQL becomes the simpler tool. Most teams do not need a streaming engine to answer Kafka questions; they need a clean, bounded way to query immutable data. SQL on Kafka segments provides exactly that.
Read a deeper post at .