ClickHouse ZooKeeper Configuration Guide: Sizing and Tuning

Q: Can I run ClickHouse without ZooKeeper or Keeper?

Yes, but only for non-replicated single-node deployments. ReplicatedMergeTree , distributed DDL, and any cross-replica coordination require a coordination service. The same service can also propagate RBAC state across replicas; see Replicating ClickHouse RBAC via Keeper .

Q: Can ZooKeeper be shared across multiple ClickHouse clusters?

Yes. Use a different path per cluster in each ClickHouse config. Sizing must account for the combined load.

Q: What's the easiest way to inspect ZooKeeper state from ClickHouse?

Query the system.zookeeper table directly: SELECT name, ctime, numChildren FROM system.zookeeper WHERE path = '/clickhouse' . You can also use zkCli.sh from any ZooKeeper installation for low-level navigation.

ZooKeeper is the coordination backbone of every replicated ClickHouse cluster. It stores replication metadata, distributed DDL tasks, and ephemeral locks. ClickHouse hammers it with small synchronous writes, so the ensemble's latency and throughput directly limit how fast a cluster can ingest data and converge after failures. This guide covers hardware sizing, JVM tuning, zoo.cfg parameters, monitoring, and when to migrate to ClickHouse Keeper.

Hardware and Topology

A ClickHouse-grade ZooKeeper ensemble looks nothing like a default Hadoop deployment. The dominant cost is transaction log fsync latency, not throughput or CPU.

Resource	Recommendation
Nodes	3 for most clusters, 5 for very large or multi-region setups. Always odd.
RAM	Minimum 4 GB, 8-16 GB for busy clusters. Swap disabled.
Disk for transaction log	Dedicated NVMe SSD. Never share with snapshots, OS, or other services.
Disk for snapshots	Separate volume from the transaction log.
Network	Low latency between ensemble members (sub-millisecond ideal). Bandwidth is not the constraint.
CPU	Reserved cores; protect from noisy neighbours.

Co-locating ZooKeeper with clickhouse-server is permitted but risky. A heavy merge on the ClickHouse side starves the JVM and triggers session expirations. If you must co-locate, pin CPU and IO with cgroups.

Three-node ensembles tolerate one node failure. Five-node ensembles tolerate two but require quorum from three, doubling the fsync cost of every write. Going beyond five rarely improves availability and always hurts write latency.

zoo.cfg Parameters

A baseline zoo.cfg for ClickHouse:

tickTime=2000
initLimit=300
syncLimit=10
dataDir=/var/lib/zookeeper
dataLogDir=/var/log/zookeeper/transactions
clientPort=2181
autopurge.snapRetainCount=10
autopurge.purgeInterval=1
maxClientCnxns=2000
maxSessionTimeout=60000000
4lw.commands.whitelist=*
preAllocSize=131072
snapCount=3000000
leaderServes=yes
standaloneEnabled=false
admin.enableServer=false

server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888

Key choices:

dataDir and dataLogDir on different physical disks. The transaction log disk must be fast and otherwise idle.
autopurge.purgeInterval=1 keeps the snapshot directory from filling up.
maxClientCnxns=2000 is generous, but a busy ClickHouse cluster can blow past the default 60.
4lw.commands.whitelist=* enables monitoring commands. Restrict to specific commands in security-sensitive environments.
preAllocSize=131072 (KB) reduces fsync amplification on small writes.

JVM Tuning

ZooKeeper is a JVM application. GC pauses are the single biggest source of instability for ClickHouse clusters. A pause longer than session_timeout_ms (default 30 seconds, but effectively much less under load) drops every ClickHouse session on that node.

In /etc/zookeeper/conf/java.env:

export JAVA_OPTS="\
  -Xms3G -Xmx3G \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=50 \
  -XX:+AlwaysPreTouch \
  -XX:+ParallelRefProcEnabled \
  -XX:+DisableExplicitGC"

export JVMFLAGS="-Djute.maxbuffer=8388608 $JAVA_OPTS"

Heap sizing rules:

Set -Xms equal to -Xmx to avoid heap resize stalls.
Never exceed 75% of system RAM. The OS page cache holds the snapshot file, and cache pressure is worse than a smaller heap.
For a 4 GB box, 3 GB heap is the maximum. For 16 GB, prefer 8-12 GB.
Disable swap. A swapped JVM stalls under GC and never recovers in time.

jute.maxbuffer must match between every ZooKeeper node and every ClickHouse server. 8 MB is the recommended value for clusters that run mutations on large tables. A mismatched or undersized buffer is a frequent cause of dropped sessions; see ClickHouse ZooKeeper Session Expired for how this surfaces at the ClickHouse layer.

ClickHouse Configuration

Point ClickHouse at the ensemble in config.xml:

<zookeeper>
    <node>
        <host>zk1.internal</host>
        <port>2181</port>
    </node>
    <node>
        <host>zk2.internal</host>
        <port>2181</port>
    </node>
    <node>
        <host>zk3.internal</host>
        <port>2181</port>
    </node>
    <session_timeout_ms>30000</session_timeout_ms>
    <operation_timeout_ms>30000</operation_timeout_ms>
    <root>/clickhouse</root>
</zookeeper>

The <root> element scopes every znode ClickHouse writes under a single prefix, which makes multi-cluster ZooKeeper hosting cleaner and limits the blast radius of an accidental delete.

Monitoring

ZooKeeper exposes four-letter commands and a Prometheus exporter. Treat these metrics as required, not optional:

# Quick health check
echo ruok | nc zk1 2181        # expect "imok"

# Detailed metrics
echo mntr | nc zk1 2181

# Show follower sync state
echo mntr | nc zk1 2181 | grep foll

Critical metrics:

Metric	Alert when
`zk_followers`	Less than `(ensemble_size - 1)`
`zk_synced_followers`	Less than `zk_followers`
`zk_outstanding_requests`	Sustained above 10
`zk_max_latency`	Spikes above 1000 ms
`zk_avg_latency`	Sustained above 50 ms
`zk_pending_syncs`	Non-zero on followers

From the ClickHouse side, watch system.zookeeper_log (21.11+) for OperationTimeout and ConnectionLoss events, and system.metrics for ZooKeeperSession, ZooKeeperWatch, and ZooKeeperRequest.

When these metrics show sustained queueing or rising latency, the cause is usually request pressure rather than a hardware fault. For a deeper diagnostic walkthrough, see ClickHouse ZooKeeper and Keeper Coordination Bottlenecks. To inspect specific replication metadata stored in the ensemble, see Checking Table Metadata in ZooKeeper.

ClickHouse Keeper as the Modern Alternative

ClickHouse Keeper has been GA since version 22.3 and is now the recommended choice for new deployments. It is written in C++, uses the RAFT consensus protocol, speaks the ZooKeeper wire protocol so existing ClickHouse clients work unchanged, and runs either embedded in clickhouse-server or as a standalone clickhouse-keeper binary. For a full overview of its architecture and configuration, see What Is ClickHouse Keeper.

Advantages over ZooKeeper:

No JVM, so no GC pauses and no jute.maxbuffer ambiguity.
Lower memory footprint for the same workload.
Native snapshot and log compression.
One operational binary across the whole stack.

Migration from ZooKeeper to Keeper is documented in the official ClickHouse docs and uses the clickhouse-keeper-converter tool to translate snapshots. Plan the migration during a maintenance window. Mixing ZooKeeper and Keeper inside a single coordination cluster is not supported.

Common Pitfalls

Co-locating the transaction log with other busy IO. Slow fsync is the dominant failure mode.
Allocating too much heap. A swapping ZooKeeper is worse than a small one.
Enabling swap on ZooKeeper hosts. Always swapoff -a and remove the entry from /etc/fstab.
Default jute.maxbuffer of 1 MB. Mutations on large tables hit this limit and trigger session expirations.
Exposing ZooKeeper publicly. Deploy behind a firewall; ZooKeeper has no meaningful authentication by default.
Running an even number of nodes. Two nodes are strictly worse than one for availability.
Skipping monitoring. ZooKeeper degrades silently until a quorum-affecting event reveals the rot.

Frequently Asked Questions

Q: Can I run ClickHouse without ZooKeeper or Keeper? A: Yes, but only for non-replicated single-node deployments. ReplicatedMergeTree, distributed DDL, and any cross-replica coordination require a coordination service. The same service can also propagate RBAC state across replicas; see Replicating ClickHouse RBAC via Keeper.

Q: How many ZooKeeper nodes should I run for ClickHouse? A: Three for most production clusters. Five if you operate across multiple availability zones or run more than a few dozen ClickHouse nodes. Always odd.

Q: Should I migrate from ZooKeeper to ClickHouse Keeper? A: For new clusters, start with Keeper. For existing healthy ZooKeeper deployments, migrate when you next have a maintenance window, especially if you have hit GC-related session expirations.

Q: Can ZooKeeper be shared across multiple ClickHouse clusters? A: Yes. Use a different <root> path per cluster in each ClickHouse config. Sizing must account for the combined load.

Q: What's the easiest way to inspect ZooKeeper state from ClickHouse? A: Query the system.zookeeper table directly: SELECT name, ctime, numChildren FROM system.zookeeper WHERE path = '/clickhouse'. You can also use zkCli.sh from any ZooKeeper installation for low-level navigation.