ClickHouse: Check Replicated Table Metadata in ZooKeeper

Q: Why is the metadata znode used as the source of truth, not the local table file?

For ReplicatedMergeTree , ZooKeeper holds the canonical schema. Replicas pull DDL from /clickhouse/tables/.../metadata and /clickhouse/tables/.../columns . Any divergence between the local .sql file and ZooKeeper is a bug that needs investigation.

Q: What do I do when `looks_good = 0`?

Identify which replica is the outlier by inspecting the path column. Compare its value with the other replicas. Decide which schema is correct, then either issue the missing DDL on the lagging replica or run SYSTEM RESTART REPLICA to force re-sync. Take a backup before any corrective action.

Q: Can I run this query across a cluster in one shot?

Yes. Wrap system.zookeeper and system.replicas in clusterAllReplicas('{cluster}', ...) so the path discovery happens on every node. Add hostName() to the projection to identify the source.

Q: Does this work for non-replicated MergeTree tables?

No. Plain MergeTree tables do not have ZooKeeper paths. The system.replicas lookup returns no rows and the query is empty. Use system.tables and system.columns instead.

Replicated tables in ClickHouse keep their authoritative schema and column list in ZooKeeper. When DDL is applied unevenly, when a replica is restored from a backup, or when migrations partially fail, the metadata and columns znodes can drift between replicas. The query in this article walks every replica of a given table and flags any node whose value differs from its peers, so you can confirm a cluster is consistent before running maintenance or troubleshooting replication lag. Run it from any node that can read system.zookeeper and system.replicas.

Comparing metadata across replicas

The query below reads three znodes per replica: metadata (the DDL hash), columns (the column list), and is_active (a per-replica liveness marker). It then uses lagInFrame inside an ordered window to compare each row to the previous one and computes a looks_good flag. Rows where looks_good = 0 indicate drift.

SELECT
    *,
    if(
        prev_name = name AND name != 'is_active',
        prev_value = value,
        1
    ) AS looks_good
FROM (
    SELECT
        name,
        path,
        ctime,
        mtime,
        value,
        lagInFrame(name)  OVER w AS prev_name,
        lagInFrame(value) OVER w AS prev_value
    FROM system.zookeeper
    WHERE (path IN (
        SELECT arrayJoin(groupUniqArray(if(path LIKE '%/replicas', concat(path, '/', name), path)))
        FROM system.zookeeper
        WHERE path IN (
            SELECT arrayJoin([zookeeper_path, concat(zookeeper_path, '/replicas')])
            FROM system.replicas
            WHERE table = 'test_repl'
        )
    )) AND (name IN ('metadata', 'columns', 'is_active'))
    WINDOW w AS (ORDER BY name = 'is_active', name ASC, path ASC
                 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
)

Replace 'test_repl' with the table name you want to inspect. The result groups rows by node type, so all columns values appear together, then all metadata values, then is_active. The is_active znode is skipped from the comparison because it differs per replica by design.

Reading the results

Column	Meaning
`name`	znode name (`metadata`, `columns`, `is_active`)
`path`	Full ZooKeeper path of the parent (the replica directory)
`value`	Contents of the znode
`ctime` / `mtime`	ZooKeeper create and modify timestamps
`looks_good`	`1` if value matches the previous replica, `0` if drift

If you see looks_good = 0 on any metadata or columns row, the cluster has schema drift. The most common causes are an ALTER TABLE that was not propagated, a manual restore that ran out of order, or a system restart replica performed while DDL was in flight.

Inspecting the local table metadata

To cross-check ZooKeeper against what the server itself believes, read system.tables:

SELECT metadata_modification_time, create_table_query
FROM system.tables
WHERE name = 'test_repl'

metadata_modification_time is the timestamp of the last DDL applied locally. create_table_query is the full CREATE TABLE statement reconstructed from the local schema. If two replicas return different metadata_modification_time values for the same table, the lagging replica has not yet applied a DDL change.

When to run these queries

Run the ZooKeeper comparison query in these scenarios:

After a large DDL change such as ALTER TABLE ... ADD COLUMN, MODIFY COLUMN, or MATERIALIZE COLUMN.
After restoring a replica from backup, or after detaching and re-attaching a replica.
When system.replication_queue shows persistent errors mentioning metadata or columns.
Before running SYSTEM RESTORE REPLICA or SYSTEM SYNC REPLICA.
As part of routine cluster health checks.

Common Pitfalls

The query is scoped to a single table. To audit a whole database, wrap it in a script that iterates system.replicas.
system.zookeeper requires a path predicate. The nested subqueries provide that. Do not strip them or the query will fail with Missing path.
A metadata mismatch is not always a bug. If a DDL was issued seconds before, the laggy replica may simply not have applied it yet. Re-run the query after a short pause.
is_active is excluded from drift detection because it is supposed to vary per replica. Do not remove that check.
Reading large columns values from ZooKeeper through system.zookeeper consumes memory on the coordinator node. For very wide tables, run the query on an idle replica.

Frequently Asked Questions

Q: Why is the metadata znode used as the source of truth, not the local table file? A: For ReplicatedMergeTree, ZooKeeper holds the canonical schema. Replicas pull DDL from /clickhouse/tables/.../metadata and /clickhouse/tables/.../columns. Any divergence between the local .sql file and ZooKeeper is a bug that needs investigation.

Q: What do I do when looks_good = 0? A: Identify which replica is the outlier by inspecting the path column. Compare its value with the other replicas. Decide which schema is correct, then either issue the missing DDL on the lagging replica or run SYSTEM RESTART REPLICA to force re-sync. Take a backup before any corrective action.

Q: Can I run this query across a cluster in one shot? A: Yes. Wrap system.zookeeper and system.replicas in clusterAllReplicas('{cluster}', ...) so the path discovery happens on every node. Add hostName() to the projection to identify the source.

Q: Does this work for non-replicated MergeTree tables? A: No. Plain MergeTree tables do not have ZooKeeper paths. The system.replicas lookup returns no rows and the query is empty. Use system.tables and system.columns instead.

Q: Why does my query return more rows than expected? A: The path expression matches both the shared table path and each replica subpath under /replicas. The query intentionally walks each replica directory so it can detect drift. The row count equals (2 znodes shared + 2 znodes per replica) * replicas.