Kafka consumer lag is the per-partition delta between the latest offset written by producers (the log end offset) and the last offset a consumer group has committed for that partition. If producers are at offset 1,000,000 and the group's committed offset is 999,400, lag is 600 messages. Lag is the single most important health signal for any Kafka pipeline: it answers "are my consumers keeping up with my producers?"
How Kafka Consumer Lag Is Calculated
Lag is computed per (consumer group, topic, partition) triple:
lag = log_end_offset - committed_offset
log_end_offsetis the offset of the next message Kafka will write to that partition, exposed asLogEndOffsetin JMX and via theListOffsetsAPI.committed_offsetis what the consumer group has committed to the internal__consumer_offsetstopic, retrieved viaOffsetFetchorkafka-consumer-groups.sh --describe.
Aggregate lag for a topic is the sum across its partitions. A group's total lag is the sum across every topic-partition it's subscribed to. Lag is not "elapsed time" by default - it's a record count. To translate to time, divide by the producer's rate, or use the consumer_lag_seconds metric some libraries (and Pulse) expose by tracking the timestamp of the last consumed record.
# Inspect lag for a consumer group
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group payments-processor
# Output columns:
# GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
# payments-procc orders 0 999400 1000000 600
A small steady-state lag is normal - it's the in-flight batch the consumer hasn't acked yet. Sustained growth, or lag that climbs after a deployment and doesn't drain, is the problem.
Why Consumer Lag Grows
Lag rises when consumer throughput falls below producer throughput. The common causes:
| Cause | How to spot it |
|---|---|
| Slow per-record processing | High records-consumed-rate per partition is fine but downstream calls are slow; check application latency |
| Too few consumer instances | All partitions assigned to fewer threads than partitions; scale to one consumer per partition |
| Too few partitions | Consumer group is already at one consumer per partition and still lagging; add partitions |
| Rebalances | Frequent rebalance-rate-per-hour spikes; consumers spend time pausing instead of processing |
| Large messages | fetch.max.bytes or max.partition.fetch.bytes too low forces many small fetches |
| Garbage-collection pauses | JVM GC pauses on consumer side suspend polling beyond max.poll.interval.ms |
| Downstream backpressure | DB, HTTP, or sink connector saturated; processing thread blocks |
| Producer burst | Lag spikes after a producer surge then drains; not always a consumer problem |
A useful debugging axiom: if only one partition is lagging, the cause is partition-specific (key skew, slow downstream for that shard, or a stuck consumer). If all partitions are lagging, the cause is the consumer fleet as a whole.
How to Reduce Kafka Consumer Lag
Mitigations in order of leverage:
- Add consumers up to the partition count. A consumer group's parallelism is capped at the partition count. If a topic has 12 partitions and you're running 4 consumer instances, scaling to 12 triples throughput. Past that, extras sit idle.
- Add partitions. When you're already at one consumer per partition and still lagging, increase the partition count (note: this changes key hashing, so per-key ordering may break across the change point).
- Increase batch size on the consumer. Raise
max.poll.records(default 500) andfetch.max.bytes(default 52428800 = 50 MiB) so each poll pulls more work. Pair this with a highermax.poll.interval.msif processing a batch takes longer than 5 minutes (default 300000 ms). - Make per-record processing async. If the consumer is blocking on an HTTP call or DB write, parallelize within a partition using a bounded executor while still committing offsets in order.
- Tune
fetch.min.bytesandfetch.max.wait.ms. For high-throughput topics, raisefetch.min.bytesto amortize fetch overhead. For low-throughput ones, keepfetch.max.wait.mslow to avoid idle waiting. - Eliminate rebalances. Use static membership (
group.instance.id) and cooperative rebalancing (partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor) so a rolling restart doesn't drop processing. - Watch for poison messages. A consumer that crashes on a specific record retries forever in the same partition, blocking everything behind it. Add a DLQ pattern.
Monitoring Consumer Lag
Lag is so central to Kafka that every monitoring stack has its own lag check. The basics:
- Built-in JMX:
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=*exposesrecords-lag-maxper consumer. kafka-consumer-groups.sh --describe: scriptable, but expensive at scale and only point-in-time.- Burrow / Kafka Lag Exporter: long-time community options that compute lag from the broker side without a running consumer.
- Cloud-native tools and APMs: Datadog, New Relic, Confluent Cloud all surface lag dashboards.
Threshold-only alerting (e.g., "page if lag > 10,000") is a trap: a topic doing 100k msg/s has different lag tolerance than one doing 10. Alert on rate of change (lag growing over a window) and age of unconsumed messages (the WarpStream-style "time lag" metric) rather than raw count.
Pulse provides AI-powered Kafka monitoring that catches consumer lag patterns most threshold systems miss: lag drift after a deploy, single-partition skew from a hot key, slow-consumer-blocking-the-group symptoms, and lag combined with rebalance storms. Pulse runs across Kafka, Elasticsearch, OpenSearch, and ClickHouse with agentic root-cause analysis - so you get "this consumer is GC-thrashing" instead of "lag is high".
Frequently Asked Questions
Q: How do I measure consumer lag in Apache Kafka?
A: Use kafka-consumer-groups.sh --bootstrap-server <broker> --describe --group <group> to print per-partition lag. Programmatically, call AdminClient.listConsumerGroupOffsets() for the group and AdminClient.listOffsets() for the partitions, then subtract. Most monitoring tools (Burrow, Kafka Lag Exporter, Pulse, Datadog, Prometheus exporters) automate this.
Q: What is an acceptable consumer lag in Kafka?
A: It depends on your SLA, not on Kafka. For tick-by-tick trading pipelines, anything more than sub-second lag is a failure. For nightly batch jobs reading off Kafka, hours of lag is fine. Set the alert threshold by the time the data is allowed to be stale, then convert to message count using the topic's average produce rate.
Q: How can I reduce Kafka consumer lag quickly?
A: Scale consumer instances up to the partition count, increase max.poll.records and fetch.max.bytes, parallelize per-partition processing if you have a bounded thread pool, and remove blocking downstream calls. If the consumer is already at the partition ceiling and still falling behind, add partitions to the topic.
Q: Does consumer lag mean Kafka has lost data?
A: No. Lag means the consumer hasn't caught up yet - data is sitting safely in the partition log. Data loss only happens if records age past retention.ms or retention.bytes before the consumer reaches them. If lag grows so large that messages reach retention while still unread, the consumer will skip them on next poll and auto.offset.reset determines where it resumes.
Q: What's the difference between consumer lag and end-to-end latency?
A: Lag is a count of unread messages at a point in time. End-to-end latency is wall-clock time from produce to consume. A consumer can have low lag but high latency if records sit in producer batches before send, or high lag and low latency if it just hasn't polled yet. Both are worth tracking.
Q: Why does consumer lag show 1 forever even when the consumer is idle?
A: This is a quirk of how compacted topics and the __consumer_offsets topic report offsets. The committed offset can lag the log end offset by one when no new records have arrived since the last commit. It's a known false positive - check whether the consumer is actually polling and the lag should not be alerted on.
Q: How does consumer lag behave during a rebalance?
A: During a rebalance, partition ownership shifts and no consumer is processing the affected partitions. Lag for those partitions grows for the duration of the rebalance (typically seconds to tens of seconds with the cooperative sticky assignor). Frequent rebalances therefore look like periodic lag spikes - usually caused by failing health checks or max.poll.interval.ms violations.
Related Reading
- Kafka Consumer: the client whose progress lag measures
- Consumer Offset: the committed offset side of the lag formula
- Kafka Topic: the unit lag is reported per
- Kafka Partition: lag is computed per partition
- Kafka Producer: the side that defines the log end offset
- Kafka Commit Log: the storage backing each partition's offsets
- Logstash Kafka Consumer Lag Detected: a related troubleshooting page
- Apache Kafka Glossary: all Kafka terms in one place