KSQL was the original name for Confluent's streaming SQL engine on Apache Kafka. In 2019 it was renamed and broadened into ksqlDB, which is what the product is officially called today. ksqlDB exposes a SQL-like interface for defining continuous queries (streams and tables) that read from and write to Kafka topics, with all the heavy lifting done by the embedded Kafka Streams library. It is not a separate database with its own storage - it runs as a tier on top of Kafka, with topics as the source of truth and RocksDB instances on ksqlDB servers as local materialized state.
How ksqlDB (formerly KSQL) Works
ksqlDB models the world as two object types:
- Streams: an unbounded sequence of immutable records, backed by a Kafka topic.
- Tables: a materialized view of the latest value per key, also backed by a Kafka topic (compacted, by convention).
You define them with SQL DDL:
CREATE STREAM clickstream (
user_id VARCHAR,
page_url VARCHAR,
ts BIGINT
) WITH (
KAFKA_TOPIC = 'clickstream',
VALUE_FORMAT = 'JSON',
TIMESTAMP = 'ts'
);
CREATE TABLE click_count_per_user AS
SELECT user_id, COUNT(*) AS clicks
FROM clickstream
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY user_id
EMIT CHANGES;
Behind the scenes, ksqlDB compiles each statement to a Kafka Streams topology, deploys it as a long-running query, materializes intermediate state in RocksDB on the ksqlDB servers, and publishes results back to Kafka topics. The continuous query keeps running until you TERMINATE it.
Key Concepts
| Concept | What it is | Notes |
|---|---|---|
| Stream | Unbounded sequence of events | Append-only, no implicit dedup |
| Table | Latest value per key | Backed by changelog topic, supports updates |
| Persistent query | A CREATE STREAM/TABLE AS SELECT that writes a topic |
Runs continuously |
| Push query | SELECT ... EMIT CHANGES for live results streamed to a client |
Long-running over the network |
| Pull query | SELECT ... WHERE key = ? against a materialized table |
Like a key-value lookup |
| Connector | ksqlDB-managed Kafka Connect source/sink | DDL: CREATE SOURCE CONNECTOR ... |
| User-Defined Function (UDF) | Java function callable from SQL | UDF, UDAF, UDTF supported |
The ksqlDB server is stateful: it holds the Kafka Streams state stores (RocksDB) on local disk and replicates them via Kafka changelog topics for fault tolerance.
KSQL Origins and Current State
ksqlDB's history matters because the name "KSQL" still appears in older blog posts, talks, and Confluent docs:
| Year | Event |
|---|---|
| 2017 | KSQL announced at Kafka Summit by Confluent |
| 2019 | KSQL renamed to ksqlDB; pull queries and connector management added |
| 2020-2023 | Continued development on ksqlDB, Confluent Cloud integration |
| 2024-2026 | Active maintenance under Confluent Community License; Confluent pushing customers toward Flink for new streaming workloads on Confluent Cloud |
The current state: ksqlDB is still supported and shipped, but Confluent has positioned Apache Flink (via Confluent Cloud's managed Flink offering) as the strategic streaming compute layer going forward. ksqlDB remains a reasonable choice for SQL-first stream processing on Kafka, especially for teams already invested in it. New projects on Confluent Cloud are increasingly steered toward Flink for richer SQL semantics, watermarks, and more flexible state.
ksqlDB vs Kafka Streams vs Flink
| Aspect | ksqlDB | Kafka Streams | Apache Flink |
|---|---|---|---|
| Interface | SQL (CLI, REST, UI) | Java/Scala DSL | SQL, DataStream API (Java/Scala/Python) |
| Runtime | ksqlDB servers (embed Kafka Streams) | Embedded library in your app | Dedicated cluster (JobManager + TaskManagers) |
| State backend | RocksDB on ksqlDB servers | RocksDB in your app | RocksDB, heap, or remote state |
| Deployment | Multiple ksqlDB nodes, scales by partitions | Scales by partitions within your app | Independent cluster, scales out separately |
| Watermarks / event time | Limited | Limited (improved over time) | First-class with full semantics |
| Best for | SQL-first stream apps on Kafka | Embedded streaming in JVM apps | Complex stateful stream processing, mixed SQL+code |
| License | Confluent Community License | Apache 2.0 | Apache 2.0 |
ksqlDB sits between "I want SQL and don't want to write code" (where it shines) and "I need complex CEP, exactly-once across multiple systems, or custom watermarks" (where Flink fits better).
When to Use ksqlDB
Good fits:
- ETL between Kafka topics with SQL transformations (filter, project, join, aggregate).
- Real-time materialized views queryable by key (pull queries).
- Continuous aggregations over windowed event streams.
- Ad-hoc exploration of Kafka topic contents via SQL.
- Stream-table joins (enriching events with reference data).
Poor fits:
- Heavy custom logic that's awkward to express in SQL (write Kafka Streams or Flink directly).
- Workloads that need precise event-time semantics with watermarks.
- Cross-system transactional consistency (use Kafka Connect with proper EOS configuration or a stream processor with two-phase commit).
- Cold analytical queries over historical data - use ClickHouse, Snowflake, or similar OLAP store downstream.
Common ksqlDB Pitfalls
- Pull queries on non-existent materialized state return errors. Pull queries only work against tables built by
CREATE TABLE AS SELECT. - Mismatched partitioning between source streams in a join causes silent data loss. The join keys must agree with the topic's partitioning. Use
PARTITION BYto repartition first. - Untracked state size. RocksDB stores grow with key cardinality. Hot keys and wide aggregations balloon disk.
- Persistent queries left running consume ksqlDB resources indefinitely. Audit and
TERMINATEunused ones. - Schema evolution on JSON streams is loose - prefer Avro or Protobuf via Schema Registry for production.
- Late-arriving data under default windowing is dropped. Tune
GRACE PERIODexplicitly.
ksqlDB in Production
ksqlDB servers are stateful and need to be operated like a tier of their own: monitored for RocksDB disk usage, GC behavior, query lag, and re-partitioning hotspots. The Kafka topics underneath (changelog topics, repartition topics, output topics) also need lifecycle management - retention, compaction, and proper sizing.
For monitoring the Kafka cluster that ksqlDB depends on, Pulse provides observability of Kafka brokers, consumer lag, partition health, and downstream consumer behavior. Stream-processing jobs (ksqlDB or Kafka Streams) are only as healthy as their broker topology, and Pulse's automated root-cause analysis traces stream-processing slowdowns back to broker, partition, or replication-level issues.
Frequently Asked Questions
Q: Is KSQL still a thing, or is it ksqlDB now?
A: The product was renamed from KSQL to ksqlDB in 2019. Everything formerly called KSQL is now ksqlDB, though older documentation, blog posts, and Stack Overflow answers still use "KSQL." Both names refer to the same engine.
Q: How is ksqlDB different from Kafka Streams?
A: Kafka Streams is a Java/Scala library you embed in your application code. ksqlDB is a server that runs Kafka Streams topologies generated from SQL statements, with no Java/Scala code required. ksqlDB is built on top of Kafka Streams - they're the same engine under the hood.
Q: Does ksqlDB store data itself, or just process it?
A: ksqlDB is not a primary store. Source-of-truth data lives in Kafka topics. ksqlDB maintains local materialized state in RocksDB on its servers (for tables and aggregations), but that state is rebuildable from Kafka changelog topics. If you DELETE the local state, ksqlDB rebuilds it from Kafka on restart.
Q: Is ksqlDB open source?
A: ksqlDB is licensed under the Confluent Community License (CCL), which permits use but restricts offering ksqlDB as a managed service. It is source-available but not OSI-approved open source. Kafka Streams (the engine inside ksqlDB) is Apache 2.0.
Q: What's the difference between a stream and a table in ksqlDB?
A: A stream is an unbounded sequence of events - every record is independent. A table is the latest value per key - new records with the same key update the existing value. Streams are append-only logs; tables are materialized views over those logs.
Q: When should I use ksqlDB instead of Apache Flink?
A: Use ksqlDB when your workload is SQL-shaped, runs on Kafka end-to-end, and doesn't need complex event-time semantics. Use Flink when you need watermarks, complex stateful logic, custom code paths, or mixed SQL+DataStream APIs. Confluent itself is positioning Flink for new projects on Confluent Cloud.
Q: Can I run ksqlDB on AWS MSK or Azure Event Hubs?
A: ksqlDB connects to any Kafka API-compatible cluster, including AWS MSK and Azure Event Hubs Kafka surface. You self-host the ksqlDB servers; the cloud provider supplies the Kafka brokers. Note that managed Kafka offerings may not run ksqlDB themselves - MSK does not provide ksqlDB as a managed service.
Related Reading
- Apache Kafka Glossary: foundational Kafka concepts
- What is Apache Kafka Topic: topics underpin streams and tables
- What is Apache Kafka Stream: streaming concepts
- What is Apache Kafka Connect: ksqlDB can manage connectors
- Apache Kafka Use Cases: where stream processing fits