ClickHouse Keeper is the built-in coordination service that ClickHouse uses for replication metadata, leader election, distributed DDL, and distributed locks. It implements the Raft consensus algorithm in C++ and speaks the ZooKeeper client protocol, so it is a drop-in replacement for Apache ZooKeeper in any ClickHouse cluster.
This guide covers how to deploy Keeper, size and tune it, point ClickHouse at it, migrate off ZooKeeper, and operate it in production. For a high-level definition see what is ClickHouse Keeper; for the ZooKeeper-side equivalents see the ZooKeeper configuration guide.
Embedded vs. Standalone Deployment
Keeper ships inside the ClickHouse server package, so you can run it two ways:
| Mode | How it runs | When to use |
|---|---|---|
| Embedded | A <keeper_server> block in clickhouse-server's config; Keeper runs as a thread inside the server process |
Small clusters, dev/test, where you want fewer moving parts |
| Standalone | The dedicated clickhouse-keeper binary on separate hosts |
Production — isolates coordination from query/merge load and lets you scale and restart independently |
Start a standalone instance with either invocation:
clickhouse-keeper --config /etc/clickhouse-keeper/keeper_config.xml
# or, using the multi-call binary:
clickhouse keeper --config /etc/clickhouse-keeper/keeper_config.xml
For anything beyond a single test box, prefer standalone Keeper on dedicated nodes. Coordination latency directly affects insert and DDL throughput, and you do not want a heavy merge or a memory-hungry query starving the Raft log writer. This separation also avoids the kind of contention described in the Keeper coordination bottlenecks guide.
Sizing and Quorum
Keeper forms a Raft quorum, so you always deploy an odd number of nodes. A majority (N/2 + 1) must be available to commit writes:
| Nodes | Quorum | Tolerated failures |
|---|---|---|
| 1 | 1 | 0 (no HA) |
| 3 | 2 | 1 |
| 5 | 3 | 2 |
Three nodes is the right answer for almost everyone. Altinity explicitly recommends not running more than three voting Keeper nodes (excluding observers): larger ensembles increase election time and per-write commit latency because every write must be acknowledged by a majority, which can slow inserts and DDL without improving anything. Go to five only when you genuinely need to survive two simultaneous node losses.
Hardware guidance, drawn from Keeper/ZooKeeper production practice:
- RAM: at least 4 GB, with swap disabled. Keeper holds its data set in memory.
- Disk: a fast, dedicated disk for the Raft log is the single most important factor. NVMe/SSD is strongly preferred because every committed write is fsync'd to the log before it is acknowledged (
force_syncis on by default). ~128 GB is typically plenty. - Network: low latency between Keeper nodes matters far more than bandwidth — Raft commit time is gated by the slowest acknowledging follower.
Distribute the three nodes across separate failure domains (racks or availability zones) so a single domain outage cannot take down the quorum.
The keeper_server Config Block
Coordination is configured under <keeper_server>. Each node gets a unique server_id, and the full ensemble topology is listed identically in <raft_configuration> on every node.
<clickhouse>
<keeper_server>
<tcp_port>9181</tcp_port>
<server_id>1</server_id>
<log_storage_path>/var/lib/clickhouse/coordination/log</log_storage_path>
<snapshot_storage_path>/var/lib/clickhouse/coordination/snapshots</snapshot_storage_path>
<coordination_settings>
<operation_timeout_ms>10000</operation_timeout_ms>
<session_timeout_ms>30000</session_timeout_ms>
<raft_logs_level>information</raft_logs_level>
</coordination_settings>
<raft_configuration>
<server>
<id>1</id>
<hostname>keeper1.internal</hostname>
<port>9234</port>
</server>
<server>
<id>2</id>
<hostname>keeper2.internal</hostname>
<port>9234</port>
</server>
<server>
<id>3</id>
<hostname>keeper3.internal</hostname>
<port>9234</port>
</server>
</raft_configuration>
</keeper_server>
</clickhouse>
Key parameters:
tcp_port— client (ZooKeeper protocol) port. Conventionally 9181 for ClickHouse Keeper. The raw default is2181, but ClickHouse deployments standardize on 9181 so it does not collide with a real ZooKeeper.server_id— unique integer per node. Never reuse or shuffle aserver_idfor a different host; keep the id-to-hostname mapping stable for the life of the cluster.log_storage_path/snapshot_storage_path— put the log on your fastest, least-busy disk, ideally separate from snapshots. The log is on the hot write path; snapshots are written periodically.<port>9234</port>insideraft_configurationis the inter-node Raft port, distinct from the clienttcp_port.
coordination_settings worth knowing
These live under <keeper_server><coordination_settings>:
| Setting | Default | Notes |
|---|---|---|
operation_timeout_ms |
10000 | Timeout for a single operation |
session_timeout_ms |
100000 | Max client session timeout (many deployments lower this to 30000) |
heart_beat_interval_ms |
500 | Leader → follower heartbeat interval |
election_timeout_lower_bound_ms |
1000 | Lower bound before a follower starts an election |
election_timeout_upper_bound_ms |
2000 | Upper bound for the same |
snapshot_distance |
100000 | Log records between snapshots |
force_sync |
true | fsync each log write before ack — keep on for durability |
auto_forwarding |
true | Followers forward writes to the leader |
async_replication |
false | See below — turn this on |
Enable async_replication
If every node in the ensemble runs a version that supports it (v23.9+), enable async replication. ClickHouse's own docs recommend it because it improves coordination performance with no downsides; it is off by default only for backward compatibility.
<coordination_settings>
<async_replication>true</async_replication>
</coordination_settings>
Roll it out only once all Keeper nodes are upgraded.
Pointing ClickHouse at Keeper
ClickHouse servers connect to Keeper through the same <zookeeper> block they would use for real ZooKeeper — the client protocol is identical. List all Keeper nodes and the client port:
<clickhouse>
<zookeeper>
<node>
<host>keeper1.internal</host>
<port>9181</port>
</node>
<node>
<host>keeper2.internal</host>
<port>9181</port>
</node>
<node>
<host>keeper3.internal</host>
<port>9181</port>
</node>
</zookeeper>
</clickhouse>
Verify the connection from a ClickHouse server:
SELECT * FROM system.zookeeper WHERE path = '/' LIMIT 10;
-- Cluster-wide Keeper connection health:
SELECT * FROM system.zookeeper_connection;
A populated result from system.zookeeper means the server is talking to the ensemble. ReplicatedMergeTree tables and ON CLUSTER DDL will then work against Keeper exactly as they did against ZooKeeper.
Migrating from ZooKeeper
You cannot mix ZooKeeper and Keeper for the same cluster, and Keeper's on-disk log/snapshot format is not ZooKeeper-compatible — so migration means converting state, not copying files.
Stop ZooKeeper writes. Take all ClickHouse servers offline (or otherwise stop all coordination writes) so the ZooKeeper snapshot is consistent.
Create a fresh ZooKeeper snapshot (e.g. via the
csnpfour-letter command on ZooKeeper) so logs and snapshots are flushed to disk.Convert the ZooKeeper data (3.4+) to a Keeper snapshot:
clickhouse-keeper-converter \ --zookeeper-logs-dir /var/lib/zookeeper/version-2 \ --zookeeper-snapshots-dir /var/lib/zookeeper/version-2 \ --output-dir /var/lib/clickhouse/coordination/snapshotsDistribute the generated snapshot to every Keeper node before starting any of them. If a node starts without the snapshot it may elect itself leader with empty state and you lose your data. Copy the same snapshot to all nodes'
snapshot_storage_path.Start the Keeper ensemble, then update each ClickHouse server's
<zookeeper>block to point at the Keeper nodes and restart.
After migration, check system.zookeeper and replication status on a few tables before resuming traffic.
Monitoring Keeper
Keeper supports ZooKeeper's "four-letter word" (4lw) commands over the client port. They must be allowlisted via <keeper_server><four_letter_word_white_list> (a sensible default set is enabled; mntr,conf,ruok and others are commonly allowed).
# Is this node alive? -> "imok"
echo ruok | nc localhost 9181
# Monitoring metrics: leader/follower state, latency, znode count, ...
echo mntr | nc localhost 9181
# Raft log indices, terms, committed/last-snapshot state:
echo lgif | nc localhost 9181
# Is this node read-only (lost quorum)? -> "ro" / "rw"
echo isro | nc localhost 9181
From mntr, watch zk_followers / zk_synced_followers (quorum health), zk_avg_latency / zk_max_latency (commit latency), and zk_znode_count (state size growth). Inside ClickHouse, the system.zookeeper_connection, system.zookeeper_log, and Keeper-specific metrics in system.metrics / system.events give you the server-side view. A growing zk_znode_count paired with rising latency is the classic sign of coordination overload — see the coordination bottlenecks guide for how to reduce part churn and DDL pressure.
Snapshots, Logs, and Disk Hygiene
Keeper writes a Raft log continuously and takes a snapshot every snapshot_distance (default 100,000) records, after which older log segments can be reclaimed. Two operational points:
- Keep the log and snapshot paths on disks with healthy free space and good fsync latency. A slow or full log disk stalls every coordination write cluster-wide.
- Do not hand-delete files under
log_storage_path/snapshot_storage_path. Keeper manages retention itself; manual deletion can corrupt state and is a common cause of Raft errors. If a single node's state is corrupt, the safe recovery is usually to stop it, wipe its coordination directory, and let it re-sync a fresh snapshot from the leader.
For more on what lives in coordination state and how to inspect it, see checking table metadata in ZooKeeper/Keeper.
Best Practices
- Run three dedicated standalone Keeper nodes across separate failure domains for any production cluster. Co-locating Keeper with busy ClickHouse servers invites latency spikes.
- Give Keeper a fast, dedicated disk for the log, separate from snapshots and from ClickHouse data, with
force_syncleft on. - Disable swap and provision ≥ 4 GB RAM on Keeper nodes.
- Enable
async_replicationonce all nodes support it (v23.9+). - Keep
server_id↔ hostname mappings immutable. Document them; never recycle an id. - Stay on a recent ClickHouse version. Keeper improves steadily, and newer feature flags are enabled by default — if you skip many versions, upgrade through an intermediate release.
- Monitor quorum and latency, not just liveness. Alert on lost synced followers and on rising
zk_max_latency.
Common Issues
- No quorum after one failure: you deployed 2 (or an even count of) voting nodes. Use 3. A 2-node ensemble loses quorum on any single failure.
Cannot create new ZooKeeper session/ session expired: usually network latency, an overloaded Keeper disk, or asession_timeout_msthat is too tight. See cannot create new ZooKeeper session and ZooKeeper session expired.- Raft errors / a node won't join: mismatched
raft_configurationacross nodes, a reusedserver_id, or corrupt local state. See Raft error. - Slow inserts and DDL: often too many voting Keeper nodes, an undersized log disk, or too much part/DDL churn hammering coordination — see coordination bottlenecks.
How Pulse Helps
Coordination problems are some of the hardest ClickHouse failures to diagnose because they show up indirectly — as stalled replication, failing DDL, or session expirations rather than an obvious "Keeper is down." Pulse continuously monitors the health of your Keeper ensemble alongside the rest of your ClickHouse cluster: quorum status, commit latency, znode growth, session churn, and the replication queues that depend on coordination. It surfaces the early warning signs — a lagging follower, a log disk filling up, latency creeping past safe thresholds — and ties them back to the tables and operations they affect, so you can fix a coordination bottleneck before it turns into a cluster-wide outage. Pulse is run by ClickHouse experts who can advise on Keeper topology, sizing, and ZooKeeper migration.
Frequently Asked Questions
Q: Should I run Keeper embedded in clickhouse-server or as a standalone process?
Standalone on dedicated nodes for production. Embedded mode is convenient for dev and small setups, but co-locating coordination with query and merge workloads risks latency spikes that slow inserts and DDL across the whole cluster.
Q: How many Keeper nodes do I need?
Three for almost every cluster — it tolerates one node failure and keeps commit latency low. Use five only if you must survive two simultaneous failures. Always an odd number, and never more than three voting nodes unless you have a concrete reason; larger ensembles slow down writes.
Q: Can I mix ZooKeeper and ClickHouse Keeper in the same cluster?
No. A given ClickHouse cluster uses one or the other. Their on-disk formats differ, so switching requires converting state with clickhouse-keeper-converter, not copying files.
Q: Which port does Keeper use?
The client (ZooKeeper-protocol) port is conventionally 9181 in ClickHouse deployments, set via tcp_port. Inter-node Raft traffic uses a separate port (commonly 9234) defined inside each <server> entry of raft_configuration.
Q: How do I check whether ClickHouse is actually connected to Keeper?
Run SELECT * FROM system.zookeeper WHERE path = '/' on a ClickHouse server — a result means the connection works. system.zookeeper_connection shows which node each server is connected to. On the Keeper side, echo mntr | nc <host> 9181 reports leader/follower state and latency.
Q: Is it safe to delete old files in the Keeper log or snapshot directory to free space?
No. Keeper manages snapshot and log retention itself based on snapshot_distance. Manually deleting files can corrupt a node's state and trigger Raft errors. If a node is corrupt, stop it, clear its coordination directory entirely, and let it re-sync a fresh snapshot from the current leader.