What is Kafka Streams? A Practical Introduction to the Stream Processing Library

Q: Does Kafka Streams guarantee exactly-once?

Yes, with processing.guarantee=exactly_once_v2 . It uses Kafka transactions to atomically commit input offsets, output writes, and changelog updates. The guarantee covers in-Kafka pipelines; side effects to external systems require idempotent design.

Q: How do I scale a Kafka Streams application?

Run more instances of the same application (same application.id ). Kafka assigns partitions to instances, so each instance processes a subset. Parallelism caps at the partition count - more instances than partitions just leaves the extras idle.

Kafka Streams is a Java/Scala library for building stream processing applications on top of Apache Kafka. It reads from Kafka topics, transforms records (filter, map, join, aggregate, window), and writes back to Kafka topics. It's not a service or framework you deploy separately - it's a library you embed in your application, which makes it dramatically simpler operationally than dedicated stream processors.

If you've ever written code like "consume from topic A, transform each record, produce to topic B," Kafka Streams is the supported, scalable, fault-tolerant version of that pattern.

What Kafka Streams Is and Isn't

Is	Isn't
A Java/JVM library	A separate service or cluster
Embedded in your app	A SQL engine (that's ksqlDB)
Backed by Kafka topics for state	A database
Designed for per-key, per-partition processing	A batch processor
Capable of exactly-once semantics	Useful outside the JVM ecosystem

Because Kafka Streams runs inside your application JVM, you deploy it with whatever you already use for Java apps (containers, Kubernetes, systemd). There's no "Streams cluster" to operate.

The Two Core Abstractions

Kafka Streams has two ways to model a topic, and almost everything else is built on them:

KStream

A KStream is a record stream - an unbounded sequence of insert-only events. Each record is independent. Use this for events: clicks, orders, sensor readings.

KStream<String, OrderEvent> orders = builder.stream("orders");
orders.filter((k, v) -> v.getAmount() > 100)
      .to("high-value-orders");

KTable

A KTable is a changelog stream - an evolving "table" of the latest value per key. Use this for state: user profiles, inventory counts, configuration.

KTable<String, UserProfile> users = builder.table("user-profiles");

The same topic can be read as a KStream (every record matters) or a KTable (latest value per key matters). Compacted topics are the natural fit for KTable because they retain the latest value per key indefinitely.

There's also GlobalKTable, which materializes the entire topic on every instance (useful for lookups). And KStream-to-KTable joins are the bread and butter of enriching events with state.

What Kafka Streams Gives You

The library handles the things that are tedious to build by hand:

Scaling and partitioning. Streams apps form a consumer group; partitions are assigned to instances. Add more instances, get more parallelism, up to the partition count.
State stores. Aggregations and joins maintain local state (a RocksDB store on each instance) backed by a Kafka changelog topic. If an instance dies, another picks up its partitions and restores state from the changelog.
Windowing. Tumbling, hopping, sliding, and session windows for time-based aggregations.
Joins. Stream-stream (windowed), stream-table, and table-table joins.
Exactly-once semantics. With processing.guarantee=exactly_once_v2, Kafka Streams uses transactions across all input/output topics and changelogs.
Re-processing. Reset the application's offsets and replay everything; state stores rebuild from scratch.

A Tiny Example

A word count - the canonical stream processing example:

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> text = builder.stream("input-text");

text.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
    .groupBy((key, word) -> word)
    .count(Materialized.as("word-counts"))
    .toStream()
    .to("word-count-output", Produced.with(Serdes.String(), Serdes.Long()));

KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

That's a complete, scalable, fault-tolerant word counter. Run multiple instances and the work is partitioned across them. Kill an instance and another picks up its state from the changelog topic.

Topology and Internal Topics

Kafka Streams turns your code into a topology - a graph of processing nodes connected by streams. The topology is what gets deployed across instances.

The Streams runtime automatically creates internal topics for:

Repartition topics: when an operation requires a different key (e.g., groupBy), Streams writes records to an internal topic keyed correctly, then reads them back. This is what makes joins and aggregations work across partitions.
Changelog topics: backing store for state. One per stateful operator. Compacted, so they keep only the latest value per key.

These internal topics are created automatically with naming like <app-id>-<store-name>-changelog. They count against your cluster's partition budget, so a complex topology with several joins and aggregations can multiply your topic count significantly.

Exactly-Once Semantics

processing.guarantee=exactly_once_v2 (introduced in Kafka 2.5 as exactly_once_beta per KIP-447 and later renamed) wraps the read-process-write loop in a Kafka transaction. Inputs, outputs, and state changelog writes all commit or abort atomically.

Caveats:

EOS adds latency (transaction commit round-trips). Default commit interval is 100ms; tune for your latency vs throughput needs.
EOS only protects in-Kafka pipelines. The moment your application talks to an external system (a database, an HTTP API), the guarantee is at-least-once unless you implement idempotent writes on the consumer side.
EOS is a per-application guarantee. Two separate Streams apps don't compose into one transaction; each is independent.

For pure topic-to-topic transformations and aggregations, EOS is essentially "turn it on and stop worrying about duplicates." For anything involving side effects, you still need idempotent design.

When to Use Kafka Streams (and When Not To)

Use Kafka Streams when:

You're already on the JVM and using Kafka heavily.
The processing fits the per-key, per-partition model (most event processing does).
You want low operational overhead: no separate cluster, easy deployment with your app.
Latency is important (Streams typically processes records in milliseconds).

Use something else when:

You're not on the JVM. Use ksqlDB (SQL interface), Faust (Python), or a separate stream processor.
You need cross-cluster or cross-source joins. Apache Flink handles this far better.
Your processing is fundamentally batch. Use Spark or a data warehouse.
You want declarative SQL. Use ksqlDB or Flink SQL.
You need to scale beyond what fits in a single JVM. Streams scales with partition count, not arbitrarily.

Common Mistakes

Over-partitioning to scale. Streams parallelism is capped at the partition count, but more partitions mean more state stores, more changelog topics, more memory. Right-size for your peak parallelism.
Ignoring state store growth. Stateful operations build local RocksDB stores. On a long-running app, these can fill disk if not bounded by retention or TTL.
Forgetting cleanUp() between test runs. Streams apps reuse the same application.id and state directory across runs by default. In tests or after schema changes, call streams.cleanUp() (or delete the state directory) to start fresh.
Mixing transactional and non-transactional output. If exactly-once is on, all writes go through transactions. Side effects (HTTP calls, DB writes) outside that transaction can produce or skip work depending on commit/abort timing.
Running one Streams instance. Single-instance apps work but defeat the fault tolerance story. Run at least two so partition reassignment is possible.
Not setting num.standby.replicas. Without standby replicas, instance failure means a slow state restore from the changelog before processing can resume. Standbys keep a warm copy.

Operational Considerations

Each Streams instance is a regular JVM process. Standard operational concerns apply:

Memory: heap for the JVM, plus off-heap memory for RocksDB. Budget at least 2-4 GB per instance for non-trivial topologies.
Disk: state stores + RocksDB caching. SSD strongly preferred.
Network: changelog and repartition traffic doubles or triples your Kafka traffic for the cluster. Plan accordingly.
Restarts: stateful Streams apps need to restore state on startup. The time depends on changelog size; num.standby.replicas reduces it.

Monitoring Kafka Streams

The metrics that matter:

Records consumed/produced per second per processor.
Process latency (process-latency-avg, process-latency-max).
Commit rate and duration - high commit latency indicates back-pressure.
Rebalances - frequent rebalances mean instances are joining/leaving the group, which usually points to instability.
State store lag (changelog topic lag) - if state changelogs are far behind, you'll re-process data on failover.
RocksDB metrics (cache hit rate, compaction stats) for stateful applications.

Pulse monitors Kafka and the consumer groups underlying Kafka Streams applications, including changelog topic health, repartition topic lag, and consumer group rebalance storms. Start a free trial to see the operational picture for your streams apps.

Frequently Asked Questions

Q: What's the difference between Kafka Streams and Kafka Connect?
A: Kafka Connect moves data in and out of Kafka without transformation (or with simple SMTs). Kafka Streams transforms data inside Kafka - reading from topics, processing, writing back. Connect is for I/O; Streams is for processing. They're complementary.

Q: What's the difference between Kafka Streams and ksqlDB?
A: ksqlDB is built on top of Kafka Streams. It offers a SQL interface and runs as a separate cluster of "ksqlDB servers" that execute SQL queries as Streams topologies. Kafka Streams is the underlying library; ksqlDB is a SQL frontend with its own server runtime.

Q: What's the difference between Kafka Streams and Apache Flink?
A: Flink is a standalone distributed stream processor with its own cluster manager, fully featured SQL, and richer windowing/joining capabilities including cross-cluster work. It's more powerful and more operationally complex. Kafka Streams is simpler and works only against Kafka, embedded in your app. For Kafka-centric workloads, Streams is usually enough; for cross-source processing or massive scale, Flink wins.

Q: Does Kafka Streams require a separate cluster?
A: No. Kafka Streams is a library that runs inside your application's JVM. You scale it by running more instances of your app. It uses the Kafka cluster you already have for state storage (changelog topics) and partition assignment (consumer groups).

Q: Can Kafka Streams be used outside Java?
A: Officially, no - it's a JVM library. Kotlin and Scala work natively. For other languages, use ksqlDB (any language can submit SQL) or alternatives like Faust (Python), though they have different feature sets.

Q: Does Kafka Streams guarantee exactly-once?
A: Yes, with processing.guarantee=exactly_once_v2. It uses Kafka transactions to atomically commit input offsets, output writes, and changelog updates. The guarantee covers in-Kafka pipelines; side effects to external systems require idempotent design.

Q: What is a Kafka Streams state store?
A: A local key-value store (default backing: RocksDB) that holds the working state of stateful operations - aggregations, joins, windowed computations. Each store is backed by a compacted Kafka topic (the changelog), so state survives instance failure.

Q: How do I scale a Kafka Streams application?
A: Run more instances of the same application (same application.id). Kafka assigns partitions to instances, so each instance processes a subset. Parallelism caps at the partition count - more instances than partitions just leaves the extras idle.