ClickHouse ZooKeeper/Keeper Coordination Bottlenecks

ClickHouse's ReplicatedMergeTree tables and ON CLUSTER DDL operations all route through a single coordination service: either Apache ZooKeeper or the built-in ClickHouse Keeper. When this layer slows down or loses sessions, the effects spread across the cluster — replication lag grows, DDL hangs, and insert latency spikes through no fault of the query or the data. The coordination layer fails silently enough that teams often look at hardware or query plans before tracing the problem back to Keeper.

What This Error Means

ZooKeeper and ClickHouse Keeper store only metadata, not data. For every ReplicatedMergeTree table, the coordination service tracks part lists, checksums, replication log entries, block deduplication hashes, mutation records, and replica state. Actual data files live on local disk or object storage.

When the coordination service is slow or unavailable, ClickHouse surfaces the problem through several different symptoms:

  • Replicas enter read-only mode when their Keeper session expires.
  • ON CLUSTER DDL statements time out or return partial results.
  • Replication queues grow as replicas cannot confirm task completion.
  • Insert latency climbs because each insert must register a block hash in Keeper for deduplication.
  • Log messages contain Coordination::Exception, ZCONNECTIONLOSS, or ZOPERATIONTIMEOUT.

A single ZooKeeper session is shared across all threads on one ClickHouse server. A session expiry therefore affects all replicated tables simultaneously, and all concurrent threads receive ZSESSIONEXPIRED at the same moment. ClickHouse automatically reconnects and resumes replication once the session is re-established; this is not a permanent failure, but it can appear catastrophic until the session recovers.

Common Causes

  1. ZooKeeper ensemble overloaded. ClickHouse generates high request rates: part registrations per insert, merge coordination, block deduplication writes, DDL task updates. A small or underpowered ZooKeeper ensemble cannot sustain the throughput of a busy cluster, and average latency climbs until operation timeouts (ZOPERATIONTIMEOUT) begin.

  2. Network latency above the recommended threshold. ClickHouse's recommended network latency between ClickHouse servers and the Keeper ensemble is under 20 ms. Cross-datacenter coordination or unreliable networks push average Keeper operation time well past this, which cascades into replication lag.

  3. Znode accumulation. Each part, replica, log entry, block deduplication record, and mutation creates one or more znodes. Clusters with many tables, fine-grained partitioning, or long-running replicas with large replicated_deduplication_window values accumulate millions of znodes. Apache ZooKeeper's in-memory data model degrades under this load; ClickHouse Keeper handles it more efficiently.

  4. JVM GC pauses (ZooKeeper only). Apache ZooKeeper runs on the JVM. Stop-the-world GC pauses of 500 ms or more are common without explicit tuning. ClickHouse is written in C++ and sends heartbeats every fraction of a second; a long GC pause on the ZooKeeper side looks like a connectivity loss and can trigger a session expiry. ClickHouse Keeper avoids this entirely as it has no JVM dependency.

  5. zxid overflow (ZooKeeper only). ZooKeeper's internal transaction counter is 32 bits and wraps after approximately 2 billion transactions. When this happens, the ZooKeeper ensemble requires a restart to reset state. ClickHouse Keeper does not have this limitation.

  6. Orphaned znodes from dropped tables. Tables that were dropped without proper cleanup leave their ZooKeeper paths intact. Over time, these accumulate and inflate snapshot size, slowing leader elections and ZooKeeper server startup.

  7. DDLWorker thread crash. The DDLWorker subprocess inside clickhouse-server runs two threads: DDLWorker and DDLWorkerCleaner. If either crashes, ON CLUSTER DDL execution stops silently. New DDL submissions queue in Keeper but are never picked up until the server restarts.

  8. Misconfigured session timeout. The session_timeout_ms setting in the <zookeeper> client block of config.xml (default 30,000 ms) controls how long ClickHouse will tolerate a lack of response before declaring the session expired. Setting this too low on a network with any jitter causes frequent spurious session expirations.

How to Fix

1. Establish baseline Keeper health

Before tuning anything, verify the coordination service is reachable and responding quickly:

-- Verify active connection and session state
SELECT
    name,
    host,
    port,
    is_expired,
    session_uptime_elapsed_seconds,
    session_timeout_ms,
    keeper_api_version,
    last_zxid_seen,
    enabled_feature_flags
FROM system.zookeeper_connection;

is_expired = 1 means the session has already lapsed. session_uptime_elapsed_seconds reset to a low value recently indicates a recent reconnect. keeper_api_version of 0 means the server is running plain ZooKeeper rather than ClickHouse Keeper.

For Apache ZooKeeper, the four-letter word commands give ensemble-level latency data:

echo mntr | nc zookeeper-host 2181 | grep -E "zk_avg_latency|zk_outstanding|zk_watch_count|zk_num_alive_connections"

zk_outstanding_requests consistently above zero means the ensemble is falling behind. zk_avg_latency above a few milliseconds on a local network is a warning sign.

2. Identify slow or erroring Keeper operations

system.zookeeper_log records every individual ZooKeeper client request and response with microsecond precision. Enable it in config.xml if not already active and then query for errors and slow operations:

-- Find slow or erroring ZooKeeper operations in the last hour
SELECT
    event_time,
    type,
    op_num,
    path,
    duration_microseconds,
    error
FROM system.zookeeper_log
WHERE error != 'ZOK'
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY duration_microseconds DESC
LIMIT 50;
-- Average and worst latency by operation type
SELECT
    op_num,
    count()                       AS ops,
    avg(duration_microseconds)    AS avg_us,
    max(duration_microseconds)    AS max_us,
    countIf(error != 'ZOK')       AS errors
FROM system.zookeeper_log
WHERE type = 'Response'
  AND event_time > now() - INTERVAL 1 HOUR
GROUP BY op_num
ORDER BY avg_us DESC;

Recurring ZCONNECTIONLOSS errors point to network instability or ensemble unavailability. Recurring ZOPERATIONTIMEOUT errors mean the ensemble is reachable but too slow to respond within operation_timeout_ms (default 10,000 ms).

3. Check replication health and lag

-- Replicas with lag, large queues, or expired sessions
SELECT
    database,
    table,
    replica_name,
    is_leader,
    absolute_delay,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    log_pointer,
    log_max_index,
    log_max_index - log_pointer    AS log_lag,
    future_parts,
    active_replicas,
    total_replicas,
    is_session_expired,
    zookeeper_exception
FROM system.replicas
WHERE queue_size > 0 OR absolute_delay > 60 OR is_session_expired = 1
ORDER BY absolute_delay DESC;

log_lag (the difference between log_max_index and log_pointer) shows how many replication log entries a replica has not yet applied. A value of zero means the replica is fully caught up. Values in the thousands that are not decreasing mean the replica is unable to process the queue.

4. Investigate stuck replication queue tasks

-- Replication tasks with errors
SELECT
    database,
    table,
    replica_name,
    type,
    create_time,
    last_attempt_time,
    num_tries,
    last_exception,
    postpone_reason
FROM system.replication_queue
WHERE last_exception != ''
ORDER BY create_time ASC
LIMIT 20;

last_exception containing Coordination::Exception or ZCONNECTIONLOSS confirms the root cause is Keeper connectivity, not a missing part or peer replica.

5. Diagnose stuck ON CLUSTER DDL

-- DDL tasks that are not finished or have errors
SELECT
    entry,
    cluster,
    query,
    query_create_time,
    host,
    status,
    exception_code,
    exception_text,
    query_duration_ms
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
   OR exception_code != 0
ORDER BY query_create_time DESC
LIMIT 30;

A task stuck in Active status for one specific host means that host is offline or in read-only mode. A task stuck in Inactive for all hosts usually means the DDLWorker thread has stopped processing. Check the ClickHouse server log for DDLWorker crash messages.

For slow clusters, extend the per-session DDL timeout:

SET distributed_ddl_task_timeout = 600;
CREATE TABLE my_table ON CLUSTER 'my_cluster' (...) ENGINE = ReplicatedMergeTree(...);

Setting distributed_ddl_task_timeout = -1 makes the client wait indefinitely. Setting it to 0 submits the DDL asynchronously without waiting for completion.

6. Recover read-only replicas

Replicas automatically enter read-only mode when the Keeper session is lost and return to normal once the session is re-established. If a replica remains in read-only mode after connectivity is restored, reinitialize its Keeper state:

-- Restart ZooKeeper session for all replicated tables on this server
SYSTEM RESTART REPLICAS;

-- Or for a specific table
SYSTEM RESTART REPLICA my_database.my_table;

SYSTEM RESTART REPLICA compares the replica's current state with what is in Keeper and rebuilds the replication queue accordingly. It does not drop data and does not require the table to be detached.

If Keeper metadata for a table was lost entirely (for example after a Keeper cluster disaster), use SYSTEM RESTORE REPLICA. This command requires the table to be in read-only mode. It detaches all parts, recreates the Keeper metadata from local disk state, then reattaches the parts:

SYSTEM RESTORE REPLICA my_database.my_table;

To remove a dead replica whose server no longer exists:

SYSTEM DROP REPLICA 'dead-replica-01' FROM TABLE my_database.my_table;

7. Reduce znode volume

Each ReplicatedMergeTree insert writes at least one block deduplication hash to Keeper. The replicated_deduplication_window table setting (default 100) controls how many recent hashes are retained. On tables with very high insert rates, reducing this value decreases the znode footprint at the cost of a narrower deduplication window:

ALTER TABLE my_table MODIFY SETTING replicated_deduplication_window = 50;

Coarser partitioning also reduces znode volume. Each partition maintains its own replication log, so a table with thousands of active partitions produces far more coordination traffic than a table with dozens.

8. Migrate to ClickHouse Keeper

ClickHouse Keeper is the recommended coordination service for new and existing deployments. It is written in C++, avoids JVM GC pauses, uses up to 46x less memory than ZooKeeper for the same volume of data, and does not suffer from zxid overflow. It uses the Raft consensus algorithm via the NuRaft library rather than ZooKeeper's ZAB protocol.

A minimal three-node Keeper configuration in config.xml:

<keeper_server>
    <tcp_port>9181</tcp_port>
    <server_id>1</server_id>
    <coordination_settings>
        <operation_timeout_ms>10000</operation_timeout_ms>
        <session_timeout_ms>100000</session_timeout_ms>
        <force_sync>true</force_sync>
        <!-- Enable on v23.9+ for improved write throughput -->
        <!-- <async_replication>true</async_replication> -->
    </coordination_settings>
    <raft_configuration>
        <server><id>1</id><hostname>keeper1</hostname><port>9234</port></server>
        <server><id>2</id><hostname>keeper2</hostname><port>9234</port></server>
        <server><id>3</id><hostname>keeper3</hostname><port>9234</port></server>
    </raft_configuration>
</keeper_server>

Point ClickHouse clients at Keeper using port 9181 (Keeper's default client port, vs ZooKeeper's 2181):

<zookeeper>
    <node>
        <host>keeper1</host>
        <port>9181</port>
    </node>
    <node>
        <host>keeper2</host>
        <port>9181</port>
    </node>
    <node>
        <host>keeper3</host>
        <port>9181</port>
    </node>
    <session_timeout_ms>30000</session_timeout_ms>
    <operation_timeout_ms>10000</operation_timeout_ms>
</zookeeper>

If running ClickHouse v23.9 or later on all Keeper nodes, enable async_replication in coordination_settings to improve write throughput. This setting is not safe to enable on mixed-version ensembles with nodes below v23.9.

Root-Cause Analysis

Monitor in-flight Keeper requests and session state

-- Point-in-time Keeper/ZooKeeper gauge metrics
SELECT metric, value, description
FROM system.metrics
WHERE metric IN (
    'ZooKeeperRequest',
    'ZooKeeperSession',
    'ZooKeeperWatch',
    'ZooKeeperSessionExpired',
    'EphemeralNode',
    'KeeperAliveConnections',
    'KeeperOutstandingRequests'
);

ZooKeeperSession should be 1 on each ClickHouse server. A value of 0 means the session is gone. ZooKeeperRequest above zero during a quiet period means Keeper operations are queuing.

Track error rates over time

-- Cumulative ZooKeeper exception counters
SELECT event, value, description
FROM system.events
WHERE event LIKE 'ZooKeeper%'
ORDER BY event;

ZooKeeperHardwareExceptions counts ZCONNECTIONLOSS errors (network-level failures). ZooKeeperUserExceptions counts ZNONODE and ZBADVERSION errors (metadata inconsistencies). A rising ZooKeeperHardwareExceptions rate points directly at network or ensemble availability problems.

Check Keeper server health (embedded Keeper)

-- Keeper server metrics when running embedded ClickHouse Keeper
SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric LIKE 'Keeper%'
ORDER BY metric;

KeeperAvgLatency and KeeperMaxLatency give the ensemble's view of its own latency. KeeperIsLeader confirms which node holds leadership. KeeperOutstandingRequests above zero indicates the leader is falling behind.

Inspect the znode tree directly

-- Browse top-level ClickHouse namespace in Keeper
SELECT name, numChildren, ctime, mtime
FROM system.zookeeper
WHERE path = '/clickhouse';

-- Count replication log entries for one table (proxy for coordination load)
SELECT count() AS log_entries
FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/my_db/my_table/log';

system.zookeeper requires a WHERE path = '...' clause. Every query hits the live Keeper ensemble, so avoid running it in tight loops or against deep paths with millions of children.

Preventive Measures

  • Deploy ClickHouse Keeper instead of Apache ZooKeeper for new clusters. Keeper eliminates JVM GC pauses, zxid overflow, and the 1 MB per-node data size limit that ZooKeeper imposes by default.
  • Run three or five Keeper/ZooKeeper nodes on dedicated hardware with SSDs. Three nodes tolerate one failure; five nodes tolerate two. Co-locating Keeper with ClickHouse on the same host means ZooKeeper disk IO competes with ClickHouse merge IO.
  • Keep the coordination ensemble in the same datacenter as ClickHouse nodes. Round-trip latency above 20 ms is a known risk factor for session instability under write load.
  • Design partition keys to produce hundreds of active partitions per table, not thousands. Each partition maintains its own log, so fine-grained partitioning multiplies Keeper traffic.
  • Use the recommended zookeeper_path convention /clickhouse/tables/{shard}/{database}/{table} with the {uuid} macro when tables are frequently recreated to avoid stale znode collisions.
  • For Apache ZooKeeper, configure: autopurge.purgeInterval=1, autopurge.snapRetainCount=10, snapCount=3000000, and use OpenJDK 11 with G1GC and -XX:MaxGCPauseMillis=50 to minimize GC pauses.
  • Monitor ZooKeeperHardwareExceptions, ZooKeeperRequest, and system.replicas.absolute_delay continuously. Alert on is_session_expired = 1 or absolute_delay exceeding your replication SLA.
  • Clean up orphaned znodes from dropped tables using clickhouse-keeper-client (available since ClickHouse 22.x) or zkCli.sh for ZooKeeper, then run SYSTEM RESTART REPLICA on affected tables.

Resolve Coordination Bottlenecks Automatically with Pulse

Pulse monitors Keeper session health, replication lag, DDL queue depth, and per-operation error rates across your ClickHouse clusters in real time. When coordination bottlenecks develop, Pulse pinpoints whether the issue is ensemble latency, a network partition, znode accumulation, or a stuck DDLWorker thread — and surfaces the specific tables and settings most likely to resolve it — without requiring manual queries against system tables. Visit pulse.support to learn more.

Frequently Asked Questions

Q: What is the difference between ClickHouse Keeper and Apache ZooKeeper for ClickHouse?
A: ClickHouse Keeper uses the Raft consensus algorithm (via NuRaft) and is written in C++. Apache ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast) and runs on the JVM. For ClickHouse workloads, Keeper avoids JVM GC pauses, uses up to 46x less memory, has no 1 MB per-znode data size limit, and does not suffer from zxid overflow. For new deployments, Keeper is the recommended choice.

Q: My replica shows is_session_expired = 1. Do I need to intervene manually?
A: Usually not. ClickHouse automatically reconnects to Keeper and resumes replication once the session is re-established. If the session remains expired for more than a few minutes after connectivity is restored, run SYSTEM RESTART REPLICA on the affected table. SYSTEM RESTART REPLICA is also the correct tool when a replica's Keeper state is stale or inconsistent after a longer outage.

Q: ON CLUSTER DDL is hanging. What should I check first?
A: Query system.distributed_ddl_queue to identify which hosts have not completed the task. If a specific host is stuck in Active status, that host is likely offline or in read-only mode. Fix the replica first; the DDL task will complete when it comes back. If all hosts show Inactive, the DDLWorker thread may have crashed — check clickhouse-server.log and restart the server. You can also increase distributed_ddl_task_timeout with SET distributed_ddl_task_timeout = 600 for the current session.

Q: How many Keeper or ZooKeeper nodes should I run?
A: Three nodes is the minimum for fault tolerance; a three-node ensemble tolerates one node failure. Five nodes tolerate two failures and are appropriate for clusters where coordination availability is critical. More than five nodes increases write latency because Raft must wait for a quorum of nodes to acknowledge each commit. Do not use ZooKeeper observers for ClickHouse coordination; the ClickHouse documentation explicitly advises against it.

Q: Can I share one Keeper ensemble across multiple ClickHouse clusters?
A: It is technically supported by using distinct path prefixes for each cluster, but it is not recommended for production. A heavy workload or misconfiguration in one cluster can exhaust Keeper capacity and affect all clusters sharing the ensemble. Use separate Keeper ensembles when possible, and configure the <distributed_ddl><path> element in each cluster's config.xml to keep DDL queues isolated.

Q: What causes ZOPERATIONTIMEOUT errors, and how do I fix them?
A: ZOPERATIONTIMEOUT means a single Keeper operation did not complete within operation_timeout_ms (default 10,000 ms in the <zookeeper> client section of config.xml). This is almost always caused by an overloaded ensemble, not a network disconnect. Diagnose with echo mntr | nc keeper-host 2181 to inspect zk_avg_latency and zk_outstanding_requests. Fix by scaling the ensemble hardware, migrating to ClickHouse Keeper, or reducing the znode footprint through coarser partitioning and lower replicated_deduplication_window values. Only increase operation_timeout_ms as a short-term measure to avoid cascading failures while you address the root cause.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.