ZooKeeper is the coordination backbone of every replicated ClickHouse cluster. It stores replication metadata, distributed DDL tasks, and ephemeral locks. ClickHouse hammers it with small synchronous writes, so the ensemble's latency and throughput directly limit how fast a cluster can ingest data and converge after failures. This guide covers hardware sizing, JVM tuning, zoo.cfg parameters, monitoring, and when to migrate to ClickHouse Keeper.
Hardware and Topology
A ClickHouse-grade ZooKeeper ensemble looks nothing like a default Hadoop deployment. The dominant cost is transaction log fsync latency, not throughput or CPU.
| Resource | Recommendation |
|---|---|
| Nodes | 3 for most clusters, 5 for very large or multi-region setups. Always odd. |
| RAM | Minimum 4 GB, 8-16 GB for busy clusters. Swap disabled. |
| Disk for transaction log | Dedicated NVMe SSD. Never share with snapshots, OS, or other services. |
| Disk for snapshots | Separate volume from the transaction log. |
| Network | Low latency between ensemble members (sub-millisecond ideal). Bandwidth is not the constraint. |
| CPU | Reserved cores; protect from noisy neighbours. |
Co-locating ZooKeeper with clickhouse-server is permitted but risky. A heavy merge on the ClickHouse side starves the JVM and triggers session expirations. If you must co-locate, pin CPU and IO with cgroups.
Three-node ensembles tolerate one node failure. Five-node ensembles tolerate two but require quorum from three, doubling the fsync cost of every write. Going beyond five rarely improves availability and always hurts write latency.
zoo.cfg Parameters
A baseline zoo.cfg for ClickHouse:
tickTime=2000
initLimit=300
syncLimit=10
dataDir=/var/lib/zookeeper
dataLogDir=/var/log/zookeeper/transactions
clientPort=2181
autopurge.snapRetainCount=10
autopurge.purgeInterval=1
maxClientCnxns=2000
maxSessionTimeout=60000000
4lw.commands.whitelist=*
preAllocSize=131072
snapCount=3000000
leaderServes=yes
standaloneEnabled=false
admin.enableServer=false
server.1=zk1:2888:3888
server.2=zk2:2888:3888
server.3=zk3:2888:3888
Key choices:
dataDiranddataLogDiron different physical disks. The transaction log disk must be fast and otherwise idle.autopurge.purgeInterval=1keeps the snapshot directory from filling up.maxClientCnxns=2000is generous, but a busy ClickHouse cluster can blow past the default 60.4lw.commands.whitelist=*enables monitoring commands. Restrict to specific commands in security-sensitive environments.preAllocSize=131072(KB) reduces fsync amplification on small writes.
JVM Tuning
ZooKeeper is a JVM application. GC pauses are the single biggest source of instability for ClickHouse clusters. A pause longer than session_timeout_ms (default 30 seconds, but effectively much less under load) drops every ClickHouse session on that node.
In /etc/zookeeper/conf/java.env:
export JAVA_OPTS="\
-Xms3G -Xmx3G \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=50 \
-XX:+AlwaysPreTouch \
-XX:+ParallelRefProcEnabled \
-XX:+DisableExplicitGC"
export JVMFLAGS="-Djute.maxbuffer=8388608 $JAVA_OPTS"
Heap sizing rules:
- Set
-Xmsequal to-Xmxto avoid heap resize stalls. - Never exceed 75% of system RAM. The OS page cache holds the snapshot file, and cache pressure is worse than a smaller heap.
- For a 4 GB box, 3 GB heap is the maximum. For 16 GB, prefer 8-12 GB.
- Disable swap. A swapped JVM stalls under GC and never recovers in time.
jute.maxbuffer must match between every ZooKeeper node and every ClickHouse server. 8 MB is the recommended value for clusters that run mutations on large tables.
ClickHouse Configuration
Point ClickHouse at the ensemble in config.xml:
<zookeeper>
<node>
<host>zk1.internal</host>
<port>2181</port>
</node>
<node>
<host>zk2.internal</host>
<port>2181</port>
</node>
<node>
<host>zk3.internal</host>
<port>2181</port>
</node>
<session_timeout_ms>30000</session_timeout_ms>
<operation_timeout_ms>30000</operation_timeout_ms>
<root>/clickhouse</root>
</zookeeper>
The <root> element scopes every znode ClickHouse writes under a single prefix, which makes multi-cluster ZooKeeper hosting cleaner and limits the blast radius of an accidental delete.
Monitoring
ZooKeeper exposes four-letter commands and a Prometheus exporter. Treat these metrics as required, not optional:
# Quick health check
echo ruok | nc zk1 2181 # expect "imok"
# Detailed metrics
echo mntr | nc zk1 2181
# Show follower sync state
echo mntr | nc zk1 2181 | grep foll
Critical metrics:
| Metric | Alert when |
|---|---|
zk_followers |
Less than (ensemble_size - 1) |
zk_synced_followers |
Less than zk_followers |
zk_outstanding_requests |
Sustained above 10 |
zk_max_latency |
Spikes above 1000 ms |
zk_avg_latency |
Sustained above 50 ms |
zk_pending_syncs |
Non-zero on followers |
From the ClickHouse side, watch system.zookeeper_log (21.11+) for OperationTimeout and ConnectionLoss events, and system.metrics for ZooKeeperSession, ZooKeeperWatch, and ZooKeeperRequest.
ClickHouse Keeper as the Modern Alternative
ClickHouse Keeper has been GA since version 22.3 and is now the recommended choice for new deployments. It is written in C++, uses the RAFT consensus protocol, speaks the ZooKeeper wire protocol so existing ClickHouse clients work unchanged, and runs either embedded in clickhouse-server or as a standalone clickhouse-keeper binary.
Advantages over ZooKeeper:
- No JVM, so no GC pauses and no
jute.maxbufferambiguity. - Lower memory footprint for the same workload.
- Native snapshot and log compression.
- One operational binary across the whole stack.
Migration from ZooKeeper to Keeper is documented in the official ClickHouse docs and uses the clickhouse-keeper-converter tool to translate snapshots. Plan the migration during a maintenance window. Mixing ZooKeeper and Keeper inside a single coordination cluster is not supported.
Common Pitfalls
- Co-locating the transaction log with other busy IO. Slow fsync is the dominant failure mode.
- Allocating too much heap. A swapping ZooKeeper is worse than a small one.
- Enabling swap on ZooKeeper hosts. Always
swapoff -aand remove the entry from/etc/fstab. - Default
jute.maxbufferof 1 MB. Mutations on large tables hit this limit and trigger session expirations. - Exposing ZooKeeper publicly. Deploy behind a firewall; ZooKeeper has no meaningful authentication by default.
- Running an even number of nodes. Two nodes are strictly worse than one for availability.
- Skipping monitoring. ZooKeeper degrades silently until a quorum-affecting event reveals the rot.
Frequently Asked Questions
Q: Can I run ClickHouse without ZooKeeper or Keeper?
A: Yes, but only for non-replicated single-node deployments. ReplicatedMergeTree, distributed DDL, and any cross-replica coordination require a coordination service.
Q: How many ZooKeeper nodes should I run for ClickHouse? A: Three for most production clusters. Five if you operate across multiple availability zones or run more than a few dozen ClickHouse nodes. Always odd.
Q: Should I migrate from ZooKeeper to ClickHouse Keeper? A: For new clusters, start with Keeper. For existing healthy ZooKeeper deployments, migrate when you next have a maintenance window, especially if you have hit GC-related session expirations.
Q: Can ZooKeeper be shared across multiple ClickHouse clusters?
A: Yes. Use a different <root> path per cluster in each ClickHouse config. Sizing must account for the combined load.
Q: What's the easiest way to inspect ZooKeeper state from ClickHouse?
A: Query the system.zookeeper table directly: SELECT name, ctime, numChildren FROM system.zookeeper WHERE path = '/clickhouse'. You can also use zkCli.sh from any ZooKeeper installation for low-level navigation.