ClickHouse ZooKeeper Session Expired: Causes and Fixes

Q: How do I tell whether jute.maxbuffer is the cause?

Look for Packet len ... is out of range in the ZooKeeper server log, or Len error in the ClickHouse log. Both indicate a request exceeded the limit.

Q: How do I clean old block numbers safely?

There is no built-in command. The safe path is to use SYSTEM RESTART REPLICA after extending operation_timeout_ms , or migrate the table to a fresh ZooKeeper path with CREATE TABLE ... ENGINE = ReplicatedMergeTree('/new/path', ...) . Manual znode deletion via zkCli.sh requires great care and a tested backup.

Zookeeper session has expired is one of the most common warnings in busy ClickHouse clusters. A single ClickHouse server maintains exactly one TCP connection to its ZooKeeper ensemble, multiplexed by two internal threads, one for reads and one for writes. When that connection breaks, every in-flight request from every query thread receives a session expiration exception at the same moment. Isolated occurrences are usually harmless, but recurring expiry points at one of a small set of root causes worth fixing.

This article complements the related guide on Cannot create a new ZooKeeper session (which covers the case where the initial handshake fails) and applies to both Apache ZooKeeper and ClickHouse Keeper deployments.

What an Expired Session Actually Means

ClickHouse opens a single ZooKeeper session per server. The session has a heartbeat (tickTime) and a timeout (session_timeout_ms, default 30 seconds). If the server fails to send a heartbeat within the timeout, the ZooKeeper ensemble closes the session and discards every ephemeral node owned by it. The next ClickHouse operation that touches ZooKeeper throws:

DB::Exception: Session expired (Session expired)
Coordination::Exception: ZSESSIONEXPIRED

Once a session is gone, ClickHouse establishes a new one automatically. In-flight mutations, inserts to replicated tables, and DDL tasks that were mid-flight may need to retry.

Common Causes

Network drops or ensemble unavailability

The simplest case. A switch reboot, a firewall flap, or a ZooKeeper node losing quorum closes the TCP socket. Check the ZooKeeper server logs around the timestamp of the ClickHouse error for Closing connection from /CH_IP or quorum loss messages.

`jute.maxbuffer` overflow

ZooKeeper enforces a hard limit on the size of a single request payload, controlled by the JVM property jute.maxbuffer. The default is 1 MB. ClickHouse hits this when:

Running ALTER TABLE ... UPDATE or DELETE mutations on tables with many parts (typically more than 5,000).
Performing large ATTACH PARTITION or REPLACE PARTITION operations.
Listing very large znodes such as /clickhouse/tables/.../block_numbers/.

The fix is to raise the limit consistently on every ZooKeeper node and every ClickHouse server:

# /etc/zookeeper/conf/java.env on ZooKeeper nodes
export JVMFLAGS="-Djute.maxbuffer=8388608"

<!-- ClickHouse config.xml -->
<zookeeper>
    <node><host>zk1</host><port>2181</port></node>
    <session_timeout_ms>30000</session_timeout_ms>
    <jute_maxbuffer>8388608</jute_maxbuffer>
</zookeeper>

Restart ZooKeeper rolling, then ClickHouse. Mismatched limits cause silent truncation, which is worse than the original error.

XID counter overflow

Every ZooKeeper transaction carries a monotonically increasing XID. The counter is a 32-bit signed integer. When it overflows INT_MAX, the ensemble closes all client sessions to reset the counter. Clusters issuing millions of writes per day can hit this within weeks. The remedy is to reduce write volume by avoiding pathological patterns: small frequent inserts, excessive ON CLUSTER DDL, and chatty mutations.

`operation_timeout_ms` exceeded during large reads

Each ZooKeeper request has its own per-operation timeout (operation_timeout_ms, default 10 seconds), distinct from the session timeout. The classic offender is merges that read block metadata:

Code: 999. Coordination::Exception: Operation timeout (no response)
for request List for path: /clickhouse/tables/<shard>/<table>/block_numbers/

ClickHouse never garbage-collects old block numbers when partitions are dropped, so block_numbers/ grows monotonically with the lifetime of the table. Eventually a single getChildren call exceeds 10 seconds and the operation fails, often dragging the session down with it.

Mitigations:

Use ALTER TABLE ... DROP PART instead of dropping whole partitions when feasible.
Periodically clean stale entries from block_numbers/ after dropping old partitions. This requires care; back up first.
Increase operation_timeout_ms to 30000, which trades latency for survival.

JVM garbage collection pauses (ZooKeeper only)

A long stop-the-world GC pause on a ZooKeeper node delays heartbeat responses. If the pause exceeds session_timeout_ms, every client session on that node expires. Tune the JVM:

Allocate at most 75% of the node's RAM to the heap, leaving room for the OS page cache.
Use G1GC with -XX:MaxGCPauseMillis=50.
Disable swap. A swapped JVM is fatal.

ClickHouse Keeper is written in C++ and has no GC, eliminating this class of failure.

Configuration Tunables

Setting	Default	Recommended	Where
`session_timeout_ms`	30000	30000-60000	ClickHouse `<zookeeper>`
`operation_timeout_ms`	10000	30000	ClickHouse `<zookeeper>`
`jute.maxbuffer`	1048576	8388608	ZooKeeper JVM and ClickHouse
`tickTime`	2000	2000	ZooKeeper `zoo.cfg`
`maxSessionTimeout`	20 * tickTime	60000000	ZooKeeper `zoo.cfg`

session_timeout_ms on the client must not exceed maxSessionTimeout on the server, or ZooKeeper silently clamps it.

Diagnostic Checklist

Confirm the expiration is recurring, not a one-off. Single events with successful retry are expected behavior in distributed systems.
Check ZooKeeper logs at the failure timestamp for Closing connection, Slow fsync, or LATENCY warnings.

Inspect system.zookeeper_log (ClickHouse 21.11+) for the failing operation:

SELECT type, event_time, op_num, path, error_message
FROM system.zookeeper_log
WHERE event_time > now() - INTERVAL 1 HOUR
  AND error_message != ''
ORDER BY event_time DESC;

Check for oversized requests with the mntr four-letter command:
```
echo mntr | nc zk_host 2181
```
Look at zk_max_latency and zk_outstanding_requests.
Count parts on tables that recently failed mutations:
```
SELECT table, count() AS parts
FROM system.parts
WHERE active
GROUP BY table
ORDER BY parts DESC
LIMIT 20;
```
More than a few thousand active parts on a single table is a red flag.

Common Pitfalls

Tuning only the ClickHouse side of jute.maxbuffer. The ZooKeeper ensemble must agree.
Increasing session_timeout_ms to 5 minutes to "fix" expirations. This delays recovery without addressing the cause and stretches the window where ephemeral locks live after a real failure.
Ignoring single expirations. They are normal. Only investigate when frequency rises above background levels (more than a few per node per day) or when they correlate with mutation failures.
Running ZooKeeper on the same disks as the OS or other busy services. Transaction log fsync latency dominates session stability.
Sticking with ZooKeeper when GC pauses keep causing expirations. ClickHouse Keeper, packaged with ClickHouse since 21.3 and GA since 22.3, is a drop-in replacement that eliminates JVM GC as a failure mode.

Frequently Asked Questions

Q: Is one session expiry per day a problem? A: Usually not. ClickHouse retries automatically and replicated tables are designed to tolerate it. Investigate only if expirations correlate with stuck mutations, dropped ephemeral locks, or query failures.

Q: How do I tell whether jute.maxbuffer is the cause? A: Look for Packet len ... is out of range in the ZooKeeper server log, or Len error in the ClickHouse log. Both indicate a request exceeded the limit.

Q: Will ClickHouse Keeper avoid these errors? A: Keeper eliminates JVM GC pauses, which removes one major cause. The protocol limits (analogous to jute.maxbuffer) and operation timeouts still apply, but Keeper's defaults and C++ implementation tend to behave better under load.

Q: How do I clean old block numbers safely? A: There is no built-in command. The safe path is to use SYSTEM RESTART REPLICA after extending operation_timeout_ms, or migrate the table to a fresh ZooKeeper path with CREATE TABLE ... ENGINE = ReplicatedMergeTree('/new/path', ...). Manual znode deletion via zkCli.sh requires great care and a tested backup.

Q: What's the difference between session expired and operation timeout? A: A session expiration kills the whole connection and all ephemeral state. An operation timeout fails one request while the session stays alive. Operation timeouts often precede session expirations because a stuck request blocks the heartbeat thread.