What is KSQL (ksqlDB)? Streaming SQL on Apache Kafka, Explained

KSQL was the original name for Confluent's streaming SQL engine on Apache Kafka. In 2019 it was renamed and broadened into ksqlDB, which is what the product is officially called today. ksqlDB exposes a SQL-like interface for defining continuous queries (streams and tables) that read from and write to Kafka topics, with all the heavy lifting done by the embedded Kafka Streams library. It is not a separate database with its own storage - it runs as a tier on top of Kafka, with topics as the source of truth and RocksDB instances on ksqlDB servers as local materialized state.

How ksqlDB (formerly KSQL) Works

ksqlDB models the world as two object types:

  • Streams: an unbounded sequence of immutable records, backed by a Kafka topic.
  • Tables: a materialized view of the latest value per key, also backed by a Kafka topic (compacted, by convention).

You define them with SQL DDL:

CREATE STREAM clickstream (
  user_id  VARCHAR,
  page_url VARCHAR,
  ts       BIGINT
) WITH (
  KAFKA_TOPIC = 'clickstream',
  VALUE_FORMAT = 'JSON',
  TIMESTAMP = 'ts'
);

CREATE TABLE click_count_per_user AS
  SELECT user_id, COUNT(*) AS clicks
  FROM clickstream
  WINDOW TUMBLING (SIZE 1 HOUR)
  GROUP BY user_id
  EMIT CHANGES;

Behind the scenes, ksqlDB compiles each statement to a Kafka Streams topology, deploys it as a long-running query, materializes intermediate state in RocksDB on the ksqlDB servers, and publishes results back to Kafka topics. The continuous query keeps running until you TERMINATE it.

Key Concepts

Concept What it is Notes
Stream Unbounded sequence of events Append-only, no implicit dedup
Table Latest value per key Backed by changelog topic, supports updates
Persistent query A CREATE STREAM/TABLE AS SELECT that writes a topic Runs continuously
Push query SELECT ... EMIT CHANGES for live results streamed to a client Long-running over the network
Pull query SELECT ... WHERE key = ? against a materialized table Like a key-value lookup
Connector ksqlDB-managed Kafka Connect source/sink DDL: CREATE SOURCE CONNECTOR ...
User-Defined Function (UDF) Java function callable from SQL UDF, UDAF, UDTF supported

The ksqlDB server is stateful: it holds the Kafka Streams state stores (RocksDB) on local disk and replicates them via Kafka changelog topics for fault tolerance.

KSQL Origins and Current State

ksqlDB's history matters because the name "KSQL" still appears in older blog posts, talks, and Confluent docs:

Year Event
2017 KSQL announced at Kafka Summit by Confluent
2019 KSQL renamed to ksqlDB; pull queries and connector management added
2020-2023 Continued development on ksqlDB, Confluent Cloud integration
2024-2026 Active maintenance under Confluent Community License; Confluent pushing customers toward Flink for new streaming workloads on Confluent Cloud

The current state: ksqlDB is still supported and shipped, but Confluent has positioned Apache Flink (via Confluent Cloud's managed Flink offering) as the strategic streaming compute layer going forward. ksqlDB remains a reasonable choice for SQL-first stream processing on Kafka, especially for teams already invested in it. New projects on Confluent Cloud are increasingly steered toward Flink for richer SQL semantics, watermarks, and more flexible state.

Aspect ksqlDB Kafka Streams Apache Flink
Interface SQL (CLI, REST, UI) Java/Scala DSL SQL, DataStream API (Java/Scala/Python)
Runtime ksqlDB servers (embed Kafka Streams) Embedded library in your app Dedicated cluster (JobManager + TaskManagers)
State backend RocksDB on ksqlDB servers RocksDB in your app RocksDB, heap, or remote state
Deployment Multiple ksqlDB nodes, scales by partitions Scales by partitions within your app Independent cluster, scales out separately
Watermarks / event time Limited Limited (improved over time) First-class with full semantics
Best for SQL-first stream apps on Kafka Embedded streaming in JVM apps Complex stateful stream processing, mixed SQL+code
License Confluent Community License Apache 2.0 Apache 2.0

ksqlDB sits between "I want SQL and don't want to write code" (where it shines) and "I need complex CEP, exactly-once across multiple systems, or custom watermarks" (where Flink fits better).

When to Use ksqlDB

Good fits:

  • ETL between Kafka topics with SQL transformations (filter, project, join, aggregate).
  • Real-time materialized views queryable by key (pull queries).
  • Continuous aggregations over windowed event streams.
  • Ad-hoc exploration of Kafka topic contents via SQL.
  • Stream-table joins (enriching events with reference data).

Poor fits:

  • Heavy custom logic that's awkward to express in SQL (write Kafka Streams or Flink directly).
  • Workloads that need precise event-time semantics with watermarks.
  • Cross-system transactional consistency (use Kafka Connect with proper EOS configuration or a stream processor with two-phase commit).
  • Cold analytical queries over historical data - use ClickHouse, Snowflake, or similar OLAP store downstream.

Common ksqlDB Pitfalls

  1. Pull queries on non-existent materialized state return errors. Pull queries only work against tables built by CREATE TABLE AS SELECT.
  2. Mismatched partitioning between source streams in a join causes silent data loss. The join keys must agree with the topic's partitioning. Use PARTITION BY to repartition first.
  3. Untracked state size. RocksDB stores grow with key cardinality. Hot keys and wide aggregations balloon disk.
  4. Persistent queries left running consume ksqlDB resources indefinitely. Audit and TERMINATE unused ones.
  5. Schema evolution on JSON streams is loose - prefer Avro or Protobuf via Schema Registry for production.
  6. Late-arriving data under default windowing is dropped. Tune GRACE PERIOD explicitly.

ksqlDB in Production

ksqlDB servers are stateful and need to be operated like a tier of their own: monitored for RocksDB disk usage, GC behavior, query lag, and re-partitioning hotspots. The Kafka topics underneath (changelog topics, repartition topics, output topics) also need lifecycle management - retention, compaction, and proper sizing.

For monitoring the Kafka cluster that ksqlDB depends on, Pulse provides observability of Kafka brokers, consumer lag, partition health, and downstream consumer behavior. Stream-processing jobs (ksqlDB or Kafka Streams) are only as healthy as their broker topology, and Pulse's automated root-cause analysis traces stream-processing slowdowns back to broker, partition, or replication-level issues.

Frequently Asked Questions

Q: Is KSQL still a thing, or is it ksqlDB now?
A: The product was renamed from KSQL to ksqlDB in 2019. Everything formerly called KSQL is now ksqlDB, though older documentation, blog posts, and Stack Overflow answers still use "KSQL." Both names refer to the same engine.

Q: How is ksqlDB different from Kafka Streams?
A: Kafka Streams is a Java/Scala library you embed in your application code. ksqlDB is a server that runs Kafka Streams topologies generated from SQL statements, with no Java/Scala code required. ksqlDB is built on top of Kafka Streams - they're the same engine under the hood.

Q: Does ksqlDB store data itself, or just process it?
A: ksqlDB is not a primary store. Source-of-truth data lives in Kafka topics. ksqlDB maintains local materialized state in RocksDB on its servers (for tables and aggregations), but that state is rebuildable from Kafka changelog topics. If you DELETE the local state, ksqlDB rebuilds it from Kafka on restart.

Q: Is ksqlDB open source?
A: ksqlDB is licensed under the Confluent Community License (CCL), which permits use but restricts offering ksqlDB as a managed service. It is source-available but not OSI-approved open source. Kafka Streams (the engine inside ksqlDB) is Apache 2.0.

Q: What's the difference between a stream and a table in ksqlDB?
A: A stream is an unbounded sequence of events - every record is independent. A table is the latest value per key - new records with the same key update the existing value. Streams are append-only logs; tables are materialized views over those logs.

Q: When should I use ksqlDB instead of Apache Flink?
A: Use ksqlDB when your workload is SQL-shaped, runs on Kafka end-to-end, and doesn't need complex event-time semantics. Use Flink when you need watermarks, complex stateful logic, custom code paths, or mixed SQL+DataStream APIs. Confluent itself is positioning Flink for new projects on Confluent Cloud.

Q: Can I run ksqlDB on AWS MSK or Azure Event Hubs?
A: ksqlDB connects to any Kafka API-compatible cluster, including AWS MSK and Azure Event Hubs Kafka surface. You self-host the ksqlDB servers; the cloud provider supplies the Kafka brokers. Note that managed Kafka offerings may not run ksqlDB themselves - MSK does not provide ksqlDB as a managed service.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.