The ClickHouse ZooKeeper schema is the tree of znodes that a ReplicatedMergeTree table creates and maintains in ZooKeeper (or ClickHouse Keeper) to coordinate replication. ZooKeeper does not store table data itself; it stores only coordination metadata: the replication log, the set of parts each replica holds, block numbers for deduplication, pending mutations, and per-replica state. This page is a structured reference of that znode tree and how to inspect it.
The Table Root Path
Every replicated table lives under a root path defined by its engine parameters, conventionally prefixed with /clickhouse/tables/. The path and replica name are usually parameterized with macros:
CREATE TABLE events ON CLUSTER my_cluster
(
event_date Date,
event_id UInt64
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/{database}/{table}',
'{replica}'
)
PARTITION BY toYYYYMM(event_date)
ORDER BY event_id;
The first argument is the zookeeper_path (shared by all replicas of the same shard); the second is the unique replica_name. The built-in {database} and {table} macros expand automatically, while {shard} and {replica} come from the server's macros configuration. Tables on different shards must use different paths. For scoping the root and configuring connectivity, see the ZooKeeper configuration guide.
Table-Level Znodes
These znodes sit directly under the table root path and describe the reference state that all replicas converge toward:
| Znode | Contents |
|---|---|
metadata |
Reference table schema: engine, partition key, sorting key, sampling expression, index granularity. Replicas compare their local schema against this. |
columns |
The reference column list (names and types). Replicas reconcile their columns to this state. |
log |
The replication log: an ordered, append-only sequence of log-NNNNNNNNNN entries describing actions (GET_PART, MERGE_PARTS, MUTATE_PART, DROP_RANGE). |
replicas |
Parent node listing every registered replica of the table. |
block_numbers |
Per-partition sequential block number allocation, used to order parts consistently across replicas. |
blocks |
Recently inserted block hashes used for insert deduplication. |
mutations |
Queue of ALTER ... UPDATE/DELETE mutations applied to the table. |
quorum |
State for quorum (insert_quorum) inserts, including the last_part and failed_parts tracking. |
leader_election |
Ephemeral nodes used to elect the leader replica, which is responsible for scheduling merges and mutations. |
alter_partition_version |
Version counter coordinating concurrent partition-level ALTER operations. |
The log and block_numbers znodes are the highest-traffic parts of the schema and the most common source of Keeper coordination bottlenecks on busy clusters.
Per-Replica Znodes
Under replicas/<replica_name>/, each replica maintains its own state so other replicas know what it holds and how far it has progressed:
| Znode | Contents |
|---|---|
is_active |
Ephemeral node present only while the replica is connected. Its absence signals the replica is down. |
host |
Connection details (host, port) used by other replicas to fetch parts. |
log_pointer |
The last log entry this replica has copied into its local replication queue. |
queue |
The replica's pending tasks copied from the shared log. |
parts |
The set of data parts this replica currently has. |
columns / metadata |
The replica's own view of its schema, compared against the table-level reference. |
metadata_version |
The schema version the replica is currently on. |
mutation_pointer |
The last mutation this replica has executed. |
min_unprocessed_insert_time / max_processed_insert_time |
Timestamps used to compute replication lag. |
is_lost |
Set when a replica falls too far behind and must re-sync from scratch. |
A replica becomes "lost" when the total number of records in the shared log exceeds max_replicated_logs_to_keep while it is inactive, at which point ClickHouse trims the log and the stale replica must recover its part set.
Inspecting the Schema with system.zookeeper
The system.zookeeper virtual table exposes the live znode tree. A WHERE path = ... (or path IN (...)) clause is mandatory, because each query performs a real read against the coordination layer:
-- List the table-level znodes
SELECT name, value, ctime, mtime
FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/default/events';
-- Inspect one replica's state
SELECT name, value
FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/default/events/replicas/replica1';
-- Peek at the tail of the replication log
SELECT name, value
FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/default/events/log'
ORDER BY name DESC
LIMIT 10;
You can resolve the root path dynamically from system.replicas.zookeeper_path. For a ready-made consistency check across replicas (comparing each replica's metadata, columns, and is_active), see check table metadata in ZooKeeper.
Best Practices
- Always parameterize
zookeeper_pathwith macros so paths stay unique per shard and portable across clusters. - Never edit znodes manually with
zkCliorkeeper-clienton a live table; let ClickHouse manage the schema and use SQL (SYSTEMcommands,ALTER) instead. - Keep the
logshort on high-ingest tables by tuning merge and insert batching rather than letting per-partition znodes accumulate. - Filter
system.zookeeperqueries bypathand avoid recursive scans of the whole tree in monitoring jobs. - Treat the znode tree as the source of truth for replication state, but remember it never holds your actual data, only coordination metadata.
Common Issues
- Orphaned znodes after DROP TABLE when a replica was unreachable, leaving stale paths under the root; these block recreating the table at the same path.
- Bloated
block_numbersfrom many small partitions, inflating Keeper memory and snapshot size. See removing stale block numbers. Node existserrors when two replicas race to create the same znode (see ZooKeeper node exists).- Replicas marked lost because the
logwas trimmed past theirlog_pointerwhile inactive.
Understanding the znode layout is essential for diagnosing replication problems, but reading raw system.zookeeper output across many tables is tedious and easy to get wrong. Pulse continuously monitors ClickHouse and Keeper, tracking replication log growth, replica lag, lost replicas, and Keeper znode counts, and surfaces actionable recommendations before coordination metadata becomes a bottleneck. This gives teams a clear view of replication health without hand-writing system.zookeeper queries for every table.
Frequently Asked Questions
Q: Does ClickHouse store table data in ZooKeeper?
A: No. ZooKeeper/Keeper stores only coordination metadata: the replication log, block numbers, per-replica part sets, mutations, and schema references. Actual parts are transferred between replicas over the network, not through Keeper.
Q: What is the difference between the table-level metadata znode and a replica's metadata znode?
A: The table-level metadata is the reference schema all replicas converge to; each replica's metadata node reflects its own current schema, which ClickHouse compares against the reference to detect drift.
Q: Why must I include a path filter when querying system.zookeeper?
A: Each row read triggers a real request to the coordination layer. Without a path predicate the query would attempt to walk the entire tree, which is expensive and is therefore disallowed.
Q: What lives under the log znode and why does it grow?
A: log holds ordered log-NNNNNNNNNN entries describing replicated actions. It grows with insert, merge, and mutation activity, and is trimmed when the total record count exceeds max_replicated_logs_to_keep; replicas that are inactive when the log is trimmed past their pointer are marked lost.
Q: How are inserts deduplicated using the schema?
A: Each inserted block's hash is recorded under blocks. If the same block is inserted again within the replicated_deduplication_window, ClickHouse sees the existing znode and skips the duplicate write.
Q: How can I tell if a replica is currently online from the schema?
A: The replica's is_active znode is ephemeral and exists only while the replica holds a live session. If it is missing, the replica is disconnected.