ClickHouse ZooKeeper Schema: What Data Lives in Keeper

The ClickHouse ZooKeeper schema is the tree of znodes that a ReplicatedMergeTree table creates and maintains in ZooKeeper (or ClickHouse Keeper) to coordinate replication. ZooKeeper does not store table data itself; it stores only coordination metadata: the replication log, the set of parts each replica holds, block numbers for deduplication, pending mutations, and per-replica state. This page is a structured reference of that znode tree and how to inspect it.

The Table Root Path

Every replicated table lives under a root path defined by its engine parameters, conventionally prefixed with /clickhouse/tables/. The path and replica name are usually parameterized with macros:

CREATE TABLE events ON CLUSTER my_cluster
(
    event_date Date,
    event_id   UInt64
)
ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/{database}/{table}',
    '{replica}'
)
PARTITION BY toYYYYMM(event_date)
ORDER BY event_id;

The first argument is the zookeeper_path (shared by all replicas of the same shard); the second is the unique replica_name. The built-in {database} and {table} macros expand automatically, while {shard} and {replica} come from the server's macros configuration. Tables on different shards must use different paths. For scoping the root and configuring connectivity, see the ZooKeeper configuration guide.

Table-Level Znodes

These znodes sit directly under the table root path and describe the reference state that all replicas converge toward:

Znode Contents
metadata Reference table schema: engine, partition key, sorting key, sampling expression, index granularity. Replicas compare their local schema against this.
columns The reference column list (names and types). Replicas reconcile their columns to this state.
log The replication log: an ordered, append-only sequence of log-NNNNNNNNNN entries describing actions (GET_PART, MERGE_PARTS, MUTATE_PART, DROP_RANGE).
replicas Parent node listing every registered replica of the table.
block_numbers Per-partition sequential block number allocation, used to order parts consistently across replicas.
blocks Recently inserted block hashes used for insert deduplication.
mutations Queue of ALTER ... UPDATE/DELETE mutations applied to the table.
quorum State for quorum (insert_quorum) inserts, including the last_part and failed_parts tracking.
leader_election Ephemeral nodes used to elect the leader replica, which is responsible for scheduling merges and mutations.
alter_partition_version Version counter coordinating concurrent partition-level ALTER operations.

The log and block_numbers znodes are the highest-traffic parts of the schema and the most common source of Keeper coordination bottlenecks on busy clusters.

Per-Replica Znodes

Under replicas/<replica_name>/, each replica maintains its own state so other replicas know what it holds and how far it has progressed:

Znode Contents
is_active Ephemeral node present only while the replica is connected. Its absence signals the replica is down.
host Connection details (host, port) used by other replicas to fetch parts.
log_pointer The last log entry this replica has copied into its local replication queue.
queue The replica's pending tasks copied from the shared log.
parts The set of data parts this replica currently has.
columns / metadata The replica's own view of its schema, compared against the table-level reference.
metadata_version The schema version the replica is currently on.
mutation_pointer The last mutation this replica has executed.
min_unprocessed_insert_time / max_processed_insert_time Timestamps used to compute replication lag.
is_lost Set when a replica falls too far behind and must re-sync from scratch.

A replica becomes "lost" when the total number of records in the shared log exceeds max_replicated_logs_to_keep while it is inactive, at which point ClickHouse trims the log and the stale replica must recover its part set.

Inspecting the Schema with system.zookeeper

The system.zookeeper virtual table exposes the live znode tree. A WHERE path = ... (or path IN (...)) clause is mandatory, because each query performs a real read against the coordination layer:

-- List the table-level znodes
SELECT name, value, ctime, mtime
FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/default/events';

-- Inspect one replica's state
SELECT name, value
FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/default/events/replicas/replica1';

-- Peek at the tail of the replication log
SELECT name, value
FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/default/events/log'
ORDER BY name DESC
LIMIT 10;

You can resolve the root path dynamically from system.replicas.zookeeper_path. For a ready-made consistency check across replicas (comparing each replica's metadata, columns, and is_active), see check table metadata in ZooKeeper.

Best Practices

  1. Always parameterize zookeeper_path with macros so paths stay unique per shard and portable across clusters.
  2. Never edit znodes manually with zkCli or keeper-client on a live table; let ClickHouse manage the schema and use SQL (SYSTEM commands, ALTER) instead.
  3. Keep the log short on high-ingest tables by tuning merge and insert batching rather than letting per-partition znodes accumulate.
  4. Filter system.zookeeper queries by path and avoid recursive scans of the whole tree in monitoring jobs.
  5. Treat the znode tree as the source of truth for replication state, but remember it never holds your actual data, only coordination metadata.

Common Issues

  1. Orphaned znodes after DROP TABLE when a replica was unreachable, leaving stale paths under the root; these block recreating the table at the same path.
  2. Bloated block_numbers from many small partitions, inflating Keeper memory and snapshot size. See removing stale block numbers.
  3. Node exists errors when two replicas race to create the same znode (see ZooKeeper node exists).
  4. Replicas marked lost because the log was trimmed past their log_pointer while inactive.

Understanding the znode layout is essential for diagnosing replication problems, but reading raw system.zookeeper output across many tables is tedious and easy to get wrong. Pulse continuously monitors ClickHouse and Keeper, tracking replication log growth, replica lag, lost replicas, and Keeper znode counts, and surfaces actionable recommendations before coordination metadata becomes a bottleneck. This gives teams a clear view of replication health without hand-writing system.zookeeper queries for every table.

Frequently Asked Questions

Q: Does ClickHouse store table data in ZooKeeper?
A: No. ZooKeeper/Keeper stores only coordination metadata: the replication log, block numbers, per-replica part sets, mutations, and schema references. Actual parts are transferred between replicas over the network, not through Keeper.

Q: What is the difference between the table-level metadata znode and a replica's metadata znode?
A: The table-level metadata is the reference schema all replicas converge to; each replica's metadata node reflects its own current schema, which ClickHouse compares against the reference to detect drift.

Q: Why must I include a path filter when querying system.zookeeper?
A: Each row read triggers a real request to the coordination layer. Without a path predicate the query would attempt to walk the entire tree, which is expensive and is therefore disallowed.

Q: What lives under the log znode and why does it grow?
A: log holds ordered log-NNNNNNNNNN entries describing replicated actions. It grows with insert, merge, and mutation activity, and is trimmed when the total record count exceeds max_replicated_logs_to_keep; replicas that are inactive when the log is trimmed past their pointer are marked lost.

Q: How are inserts deduplicated using the schema?
A: Each inserted block's hash is recorded under blocks. If the same block is inserted again within the replicated_deduplication_window, ClickHouse sees the existing znode and skips the duplicate write.

Q: How can I tell if a replica is currently online from the schema?
A: The replica's is_active znode is ephemeral and exists only while the replica holds a live session. If it is missing, the replica is disconnected.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.