Zookeeper session has expired is one of the most common warnings in busy ClickHouse clusters. A single ClickHouse server maintains exactly one TCP connection to its ZooKeeper ensemble, multiplexed by two internal threads, one for reads and one for writes. When that connection breaks, every in-flight request from every query thread receives a session expiration exception at the same moment. Isolated occurrences are usually harmless, but recurring expiry points at one of a small set of root causes worth fixing.
This article complements the related guide on Cannot create a new ZooKeeper session (which covers the case where the initial handshake fails) and applies to both Apache ZooKeeper and ClickHouse Keeper deployments.
What an Expired Session Actually Means
ClickHouse opens a single ZooKeeper session per server. The session has a heartbeat (tickTime) and a timeout (session_timeout_ms, default 30 seconds). If the server fails to send a heartbeat within the timeout, the ZooKeeper ensemble closes the session and discards every ephemeral node owned by it. The next ClickHouse operation that touches ZooKeeper throws:
DB::Exception: Session expired (Session expired)
Coordination::Exception: ZSESSIONEXPIRED
Once a session is gone, ClickHouse establishes a new one automatically. In-flight mutations, inserts to replicated tables, and DDL tasks that were mid-flight may need to retry.
Common Causes
Network drops or ensemble unavailability
The simplest case. A switch reboot, a firewall flap, or a ZooKeeper node losing quorum closes the TCP socket. Check the ZooKeeper server logs around the timestamp of the ClickHouse error for Closing connection from /CH_IP or quorum loss messages.
jute.maxbuffer overflow
ZooKeeper enforces a hard limit on the size of a single request payload, controlled by the JVM property jute.maxbuffer. The default is 1 MB. ClickHouse hits this when:
- Running
ALTER TABLE ... UPDATEorDELETEmutations on tables with many parts (typically more than 5,000). - Performing large
ATTACH PARTITIONorREPLACE PARTITIONoperations. - Listing very large znodes such as
/clickhouse/tables/.../block_numbers/.
The fix is to raise the limit consistently on every ZooKeeper node and every ClickHouse server:
# /etc/zookeeper/conf/java.env on ZooKeeper nodes
export JVMFLAGS="-Djute.maxbuffer=8388608"
<!-- ClickHouse config.xml -->
<zookeeper>
<node><host>zk1</host><port>2181</port></node>
<session_timeout_ms>30000</session_timeout_ms>
<jute_maxbuffer>8388608</jute_maxbuffer>
</zookeeper>
Restart ZooKeeper rolling, then ClickHouse. Mismatched limits cause silent truncation, which is worse than the original error.
XID counter overflow
Every ZooKeeper transaction carries a monotonically increasing XID. The counter is a 32-bit signed integer. When it overflows INT_MAX, the ensemble closes all client sessions to reset the counter. Clusters issuing millions of writes per day can hit this within weeks. The remedy is to reduce write volume by avoiding pathological patterns: small frequent inserts, excessive ON CLUSTER DDL, and chatty mutations.
operation_timeout_ms exceeded during large reads
Each ZooKeeper request has its own per-operation timeout (operation_timeout_ms, default 10 seconds), distinct from the session timeout. The classic offender is merges that read block metadata:
Code: 999. Coordination::Exception: Operation timeout (no response)
for request List for path: /clickhouse/tables/<shard>/<table>/block_numbers/
ClickHouse never garbage-collects old block numbers when partitions are dropped, so block_numbers/ grows monotonically with the lifetime of the table. Eventually a single getChildren call exceeds 10 seconds and the operation fails, often dragging the session down with it.
Mitigations:
- Use
ALTER TABLE ... DROP PARTinstead of dropping whole partitions when feasible. - Periodically clean stale entries from
block_numbers/after dropping old partitions. This requires care; back up first. - Increase
operation_timeout_msto 30000, which trades latency for survival.
JVM garbage collection pauses (ZooKeeper only)
A long stop-the-world GC pause on a ZooKeeper node delays heartbeat responses. If the pause exceeds session_timeout_ms, every client session on that node expires. Tune the JVM:
- Allocate at most 75% of the node's RAM to the heap, leaving room for the OS page cache.
- Use G1GC with
-XX:MaxGCPauseMillis=50. - Disable swap. A swapped JVM is fatal.
ClickHouse Keeper is written in C++ and has no GC, eliminating this class of failure.
Configuration Tunables
| Setting | Default | Recommended | Where |
|---|---|---|---|
session_timeout_ms |
30000 | 30000-60000 | ClickHouse <zookeeper> |
operation_timeout_ms |
10000 | 30000 | ClickHouse <zookeeper> |
jute.maxbuffer |
1048576 | 8388608 | ZooKeeper JVM and ClickHouse |
tickTime |
2000 | 2000 | ZooKeeper zoo.cfg |
maxSessionTimeout |
20 * tickTime | 60000000 | ZooKeeper zoo.cfg |
session_timeout_ms on the client must not exceed maxSessionTimeout on the server, or ZooKeeper silently clamps it.
Diagnostic Checklist
Confirm the expiration is recurring, not a one-off. Single events with successful retry are expected behavior in distributed systems.
Check ZooKeeper logs at the failure timestamp for
Closing connection,Slow fsync, orLATENCYwarnings.Inspect
system.zookeeper_log(ClickHouse 21.11+) for the failing operation:SELECT type, event_time, op_num, path, error_message FROM system.zookeeper_log WHERE event_time > now() - INTERVAL 1 HOUR AND error_message != '' ORDER BY event_time DESC;Check for oversized requests with the
mntrfour-letter command:echo mntr | nc zk_host 2181Look at
zk_max_latencyandzk_outstanding_requests.Count parts on tables that recently failed mutations:
SELECT table, count() AS parts FROM system.parts WHERE active GROUP BY table ORDER BY parts DESC LIMIT 20;More than a few thousand active parts on a single table is a red flag.
Common Pitfalls
- Tuning only the ClickHouse side of
jute.maxbuffer. The ZooKeeper ensemble must agree. - Increasing
session_timeout_msto 5 minutes to "fix" expirations. This delays recovery without addressing the cause and stretches the window where ephemeral locks live after a real failure. - Ignoring single expirations. They are normal. Only investigate when frequency rises above background levels (more than a few per node per day) or when they correlate with mutation failures.
- Running ZooKeeper on the same disks as the OS or other busy services. Transaction log fsync latency dominates session stability.
- Sticking with ZooKeeper when GC pauses keep causing expirations. ClickHouse Keeper, packaged with ClickHouse since 21.3 and GA since 22.3, is a drop-in replacement that eliminates JVM GC as a failure mode.
Frequently Asked Questions
Q: Is one session expiry per day a problem? A: Usually not. ClickHouse retries automatically and replicated tables are designed to tolerate it. Investigate only if expirations correlate with stuck mutations, dropped ephemeral locks, or query failures.
Q: How do I tell whether jute.maxbuffer is the cause?
A: Look for Packet len ... is out of range in the ZooKeeper server log, or Len error in the ClickHouse log. Both indicate a request exceeded the limit.
Q: Will ClickHouse Keeper avoid these errors?
A: Keeper eliminates JVM GC pauses, which removes one major cause. The protocol limits (analogous to jute.maxbuffer) and operation timeouts still apply, but Keeper's defaults and C++ implementation tend to behave better under load.
Q: How do I clean old block numbers safely?
A: There is no built-in command. The safe path is to use SYSTEM RESTART REPLICA after extending operation_timeout_ms, or migrate the table to a fresh ZooKeeper path with CREATE TABLE ... ENGINE = ReplicatedMergeTree('/new/path', ...). Manual znode deletion via zkCli.sh requires great care and a tested backup.
Q: What's the difference between session expired and operation timeout? A: A session expiration kills the whole connection and all ephemeral state. An operation timeout fails one request while the session stays alive. Operation timeouts often precede session expirations because a stuck request blocks the heartbeat thread.