Recovering from Complete ZooKeeper/Keeper Metadata Loss

ZooKeeper (or ClickHouse Keeper) holds the reference state for every ReplicatedMergeTree table: the table schema, the list of parts each replica owns, the block-number counters used for deduplication, the replication queue, and the DDL queue. When that coordination state is lost entirely — a wiped ensemble, a deleted /clickhouse root, or an empty fresh Keeper — your replicated tables drop into read-only mode. Data is still queryable, but inserts, merges, and DDL stop working.

This guide covers full coordination-state loss, which is different from the scenario in the disaster recovery guide (where ZooKeeper paths survive and only local data parts are gone). Here the data on disk is usually fine and the metadata is what must be rebuilt. The modern path is SYSTEM RESTORE REPLICA; the manual recreate-and-attach procedure remains the fallback for older versions or partial failures.

How to Recognize Complete Metadata Loss

Symptoms of lost coordination state include:

  • Replicated tables refuse inserts with errors about being in read-only mode.
  • system.replicas shows is_readonly = 1 for affected tables.
  • The Keeper/ZooKeeper paths under your table's znode root (e.g. /clickhouse/tables/...) are missing or empty.
  • Server logs report that the table's metadata node does not exist in ZooKeeper.
SELECT database, table, is_readonly, zookeeper_path, replica_name
FROM system.replicas
WHERE is_readonly = 1;

If you can still see the znodes but they look inconsistent, you may have a partial problem (orphaned znodes, stale block numbers, a stuck DDL or replication queue) rather than total loss. Those are handled by narrower procedures — see ZooKeeper/Keeper coordination bottlenecks, removing stale block numbers, and the replication queue guide. Use checking table metadata in ZooKeeper to confirm what is actually present before acting.

Why You Cannot Simply Restore a ZooKeeper Backup

ZooKeeper/Keeper stores the live state of a distributed system, not durable data. Which replica holds which part, which merges are pending, which mutations are in flight — all of this changes constantly. A snapshot taken even seconds ago is already inconsistent with the actual parts on disk.

Because of this, the practical recovery strategy is not "restore the ZK snapshot." Instead you keep the ClickHouse data files safe (they are the source of truth) and rebuild the coordination state from the data on disk. That is precisely what SYSTEM RESTORE REPLICA does. The same logic applies to backups: back up ClickHouse data, not Keeper, and reconstruct metadata afterward. See ClickHouse backup for data-backup strategy.

Method 1: SYSTEM RESTORE REPLICA (ClickHouse 21.7+)

Since ClickHouse 21.7, SYSTEM RESTORE REPLICA rebuilds a table's ZooKeeper metadata from the parts present locally. It works only on read-only replicas — which is exactly the state a table enters after losing its metadata, so the precondition is satisfied automatically.

Syntax:

SYSTEM RESTORE REPLICA [db.]table_name [ON CLUSTER cluster_name];

Prerequisites

  1. The Keeper/ZooKeeper ensemble itself must be running and reachable again (even if its /clickhouse data is empty). SYSTEM RESTORE REPLICA writes metadata into Keeper; it cannot run against a dead ensemble.
  2. The table must still exist locally on the replica (its .sql metadata file under /var/lib/clickhouse/metadata/), and its data parts should be intact on disk.
  3. The table must be in read-only mode.

If the znode path is completely empty, create the parent path first or rely on ClickHouse to recreate the table-level node. In practice, restart the server (or run SYSTEM RESTART REPLICA) so the replica re-attaches and discovers the missing metadata, then run the restore.

Procedure

-- 1. Confirm the table is read-only
SELECT database, table, is_readonly FROM system.replicas WHERE is_readonly = 1;

-- 2. Restore on a single replica
SYSTEM RESTORE REPLICA my_database.events;

-- 3. Or restore the whole cluster in one statement
SYSTEM RESTORE REPLICA my_database.events ON CLUSTER my_cluster;

What it does internally

  • It pushes the table's metadata (schema, replica node) back into ZooKeeper/Keeper.
  • It moves all local parts into the detached/ folder, clears internal state, then re-attaches the committed parts and registers them in Keeper.
  • Parts that were already present locally before the loss are not re-fetched over the network unless they are outdated, which makes recovery fast and low-bandwidth.
  • When run ON CLUSTER, only the first query for a given table's metadata succeeds in writing the reference schema; other replicas attach against it. This makes it safe to run in parallel.

After restore, verify:

SELECT database, table, is_readonly, active_replicas, total_replicas
FROM system.replicas
WHERE database = 'my_database';

is_readonly should now be 0, and inserts should succeed again.

What gets reset (and the caveats)

SYSTEM RESTORE REPLICA rebuilds state from local parts, so be aware of what is not carried over from the old (lost) metadata:

  • Block numbers / insert deduplication: the deduplication history is rebuilt fresh. Re-inserting previously inserted blocks may no longer be deduplicated, so avoid replaying old inserts blindly.
  • Replication queue: the in-flight queue is gone. Pending fetches/merges that existed only in the queue are lost; the cluster re-derives work from the restored part lists.
  • In-flight mutations and DDL: mutations or ALTERs that had not finished are not recovered. Re-issue them after confirming the schema is correct.
  • Schema source of truth: the schema is taken from the local table definition. If replicas disagreed before the loss, restore from the replica with the correct, most complete data.

Method 2: Manual Recreate-and-Attach (Older Versions / Fallback)

If you are on a version without SYSTEM RESTORE REPLICA, or the restore cannot run for a specific table, you can rebuild metadata manually by detaching the table, attaching it as a plain MergeTree, then moving its partitions into a freshly created ReplicatedMergeTree table that registers clean znodes.

-- 1. Detach the broken replicated table
DETACH TABLE events;

Edit the table's metadata file (e.g. /var/lib/clickhouse/metadata/my_database/events.sql) and change the engine from ReplicatedMergeTree(...) to plain MergeTree(). Keep a backup of the original definition — you need the exact ZooKeeper path and replica macros for step 4.

-- 2. Attach as a non-replicated table (data parts stay in place)
ATTACH TABLE events;

-- 3. Rename it out of the way
RENAME TABLE events TO events_old;

-- 4. Recreate the replicated table with the ORIGINAL DDL.
--    This registers fresh, clean metadata in ZooKeeper/Keeper.
CREATE TABLE events
(
    event_date Date,
    user_id    UInt64,
    value      Float64
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);

-- 5. Move every partition from the old table into the new replicated one
ALTER TABLE events ATTACH PARTITION 202401 FROM events_old;
ALTER TABLE events ATTACH PARTITION 202402 FROM events_old;
-- ... repeat per partition; script this for many partitions

-- 6. Drop the leftover table once all partitions are moved and verified
DROP TABLE events_old;

ATTACH PARTITION ... FROM copies parts via hardlinks within the same disk, so it is fast; data remains in the source table until you explicitly drop it (step 6 above). Do the recreate on one designated "master" replica first; the other replicas then fetch parts from it through normal replication, so you do not attach the same partitions on every node (that would create extra replicated copies).

For many tables, the archived Altinity clickhouse-zookeeper-recovery script partially automates this flow (hardlink backup, rebuild as non-replicated, recreate replicated, designate one master replica). It requires ClickHouse to be offline and does not support multi-disk setups — for modern versions, prefer SYSTEM RESTORE REPLICA.

SYSTEM RESTORE REPLICA vs. SYSTEM RESTART REPLICA

These two are easy to confuse but do opposite things:

Aspect SYSTEM RESTORE REPLICA SYSTEM RESTART REPLICA
Source of truth Local data parts on disk ZooKeeper/Keeper
Use when Keeper metadata is lost/empty, table is read-only Keeper is intact; you want to re-sync the session and rebuild the local queue
Effect on Keeper Writes table metadata + part list back into Keeper Reads Keeper, re-queues missing tasks locally
Precondition Table must be read-only Table can be normal
Risk if misused None for true loss; unnecessary churn otherwise Will not help if Keeper itself is empty — it trusts Keeper

Rule of thumb: if Keeper lost the metadata, you need RESTORE. If Keeper is fine but the local replica drifted, you need RESTART.

Common Issues

  • Running RESTORE before Keeper is back: SYSTEM RESTORE REPLICA writes to Keeper. If the ensemble is still down or the connection string is wrong, fix Keeper first. See cannot create new ZooKeeper session and the ZooKeeper configuration guide.
  • Table not read-only: RESTORE refuses to run on a healthy table. If the table did not flip to read-only automatically, restart the replica so it detects the missing metadata.
  • Duplicate data after manual recovery: attaching the same partitions on multiple replicas instead of letting replication copy them. Recreate on one master replica only.
  • Inserts still rejected after restore: confirm is_readonly = 0 in system.replicas, and check that the ZooKeeper path in the table DDL matches what was recreated. A path mismatch leaves the table looking for metadata that does not exist. Use checking table metadata in ZooKeeper to confirm.
  • Stale leftovers from before the loss: orphaned znodes or stale block numbers from dropped tables can linger. Clean them with the coordination bottlenecks and stale block numbers procedures.

Best Practices

  1. Run a proper 3-node (or 5-node) Keeper/ZooKeeper ensemble. Quorum redundancy prevents most "complete loss" events in the first place. See ClickHouse Keeper.
  2. Back up ClickHouse data, not Keeper. Keeper state is non-durable and inconsistent to snapshot; reconstruct metadata from data with SYSTEM RESTORE REPLICA. See ClickHouse backup.
  3. Keep table DDL under version control. The manual procedure needs the exact original CREATE TABLE statement, including the ZooKeeper path and macros.
  4. Test recovery in staging. Practise SYSTEM RESTORE REPLICA on a non-production cluster so the read-only precondition and verification steps are familiar before a real incident.
  5. After recovery, audit deduplication. Block numbers reset; do not blindly replay historical inserts expecting them to be deduplicated.

How Pulse Helps

Recovering from complete coordination-state loss is high-stakes: a wrong ATTACH PARTITION or a RESTORE against a half-configured Keeper can duplicate data or extend an outage. Pulse continuously monitors ClickHouse and Keeper/ZooKeeper health — read-only replicas, missing znodes, quorum loss, and replication-queue stalls — and surfaces them before they cascade into a full outage. When something does go wrong, Pulse gives you the diagnostic context (which replicas are read-only, which znode paths are missing, whether local parts are intact) to choose between SYSTEM RESTORE REPLICA and the manual procedure with confidence, and to verify the cluster is genuinely healthy afterward.

Frequently Asked Questions

Q: Will I lose data when recovering from ZooKeeper metadata loss?

Generally no, as long as the data parts on disk are intact on at least one replica. Both SYSTEM RESTORE REPLICA and the manual procedure rebuild metadata from existing local parts. What you lose is non-durable coordination state: the replication queue, in-flight mutations/DDL, and deduplication history.

Q: Do I need to restore a ZooKeeper/Keeper backup first?

No. ZooKeeper state is constantly changing and a snapshot is inconsistent with the actual parts on disk almost immediately. The recommended approach is to get the ensemble running again (even empty) and rebuild table metadata from the data with SYSTEM RESTORE REPLICA.

Q: What is the difference between SYSTEM RESTORE REPLICA and SYSTEM RESTART REPLICA?

RESTORE treats the local data on disk as the source of truth and writes metadata back into Keeper — use it when Keeper lost the metadata. RESTART treats Keeper as the source of truth and re-syncs the local replica from it — use it only when Keeper is intact.

Q: Can I run SYSTEM RESTORE REPLICA across the whole cluster at once?

Yes. SYSTEM RESTORE REPLICA db.table ON CLUSTER cluster_name runs on all replicas. Only the first query writes the reference schema into Keeper; the rest attach against it, so parallel execution is safe.

Q: What happens to insert deduplication after recovery?

The block-number / deduplication history is rebuilt fresh. Previously inserted blocks are no longer guaranteed to be recognized as duplicates, so avoid replaying old insert batches after a restore unless you have other deduplication in place.

Q: My ClickHouse version is older than 21.7 — what are my options?

Use the manual recreate-and-attach procedure (detach, change engine to MergeTree, attach, recreate as ReplicatedMergeTree, move partitions), or the archived Altinity clickhouse-zookeeper-recovery script for many tables. Upgrading to a current 24.x/25.x release is strongly recommended so SYSTEM RESTORE REPLICA is available.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.