ClickHouse Disaster Recovery: Restoring After Complete Data Loss

Q: Do I need to stop the healthy replica during recovery?

No. srv1 keeps serving reads and writes while srv2 fetches parts in the background.

When a ClickHouse node loses all local data but the cluster still has a healthy replica, full recovery is possible without restoring from a backup. The procedure relies on cloning database and table metadata from a peer, then letting ClickHouse repopulate parts from ZooKeeper/Keeper and the surviving replica. This guide walks through the steps for ReplicatedMergeTree tables with both Atomic and Ordinary databases, using the force_restore_data flag to bypass safety checks during the rebuild.

Use this procedure when:

A node's disk was wiped or replaced and you still have at least one healthy replica.
ZooKeeper/Keeper paths for the affected tables are intact.
The replicas use a shared ZooKeeper path (the normal ReplicatedMergeTree setup).

If you have no surviving replicas, restore from a clickhouse-backup archive instead.

Scenario

Two servers srv1 and srv2 run a replicated cluster. srv2 lost all data. srv1 is healthy.

Step 1: Generate database DDL on the healthy node

On srv1, export CREATE DATABASE statements for every user database:

SELECT concat(
  'CREATE DATABASE "', name, '" ENGINE = ', engine,
  ' COMMENT ''', comment, ''';'
)
FROM system.databases
WHERE name NOT IN ('INFORMATION_SCHEMA', 'information_schema', 'system', 'default')
INTO OUTFILE '/tmp/create_database.sql'
FORMAT TabSeparatedRaw;

Copy create_database.sql to srv2:

scp /tmp/create_database.sql srv2:/tmp/

Step 2: Archive metadata on the healthy node

The metadata/ directory contains one .sql file per table and .sql files per database. Archive it with symlinks dereferenced (Atomic databases store metadata as symlinks):

cd /var/lib/clickhouse/
tar -cvhf /tmp/metadata_schema.tar metadata
scp /tmp/metadata_schema.tar srv2:/tmp/

The -h flag is critical: without it, you archive the symlinks instead of the real files and the restore fails.

Step 3: Wipe the damaged node

On srv2, stop ClickHouse and remove all data:

sudo systemctl stop clickhouse-server
sudo rm -rf /var/lib/clickhouse/*

If /var/lib/clickhouse lives on a freshly mounted disk, it should already be empty. Verify before continuing.

Step 4: Recreate databases

Start the server so it initializes default system databases, then apply the DDL from step 1:

sudo systemctl start clickhouse-server
clickhouse-client < /tmp/create_database.sql
sudo systemctl stop clickhouse-server

This creates empty database directories under /var/lib/clickhouse/metadata/ and /var/lib/clickhouse/store/ (for Atomic engines).

Step 5: Restore table metadata

Extract the metadata archive from step 2 over the now-initialized data directory:

cd /var/lib/clickhouse/
sudo tar xkfv /tmp/metadata_schema.tar

The -k flag keeps existing files, so the system database metadata created during initialization is preserved while user table .sql files are added.

Fix ownership:

sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata

Step 6: Set the force_restore_data flag

This flag tells ClickHouse, on next startup, to attach tables whose data directories are missing or empty without failing, and to fetch parts from the replica via ZooKeeper:

sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data

The flag is consumed on first startup and removed automatically.

Step 7: Start the server and let it sync

sudo systemctl start clickhouse-server

For each ReplicatedMergeTree table, the server connects to ZooKeeper, finds the existing replica path, and starts fetching parts from srv1. Monitor progress:

SELECT database, table, is_leader, total_replicas, active_replicas,
       queue_size, inserts_in_queue, merges_in_queue
FROM system.replicas
ORDER BY queue_size DESC;

And the fetch queue:

SELECT database, table, type, source_replica, parts_to_merge
FROM system.replication_queue
LIMIT 20;

Once queue_size drops to zero across all tables, recovery is complete.

Step 8: Verify

Compare row counts between the two replicas:

-- on srv1 and srv2
SELECT database, table, sum(rows) AS rows, sum(bytes_on_disk) AS bytes
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY database, table;

Numbers should match. Some lag is normal if writes are ongoing; let the cluster settle.

Recovering non-replicated tables

The procedure above relies on replication to fetch data. For MergeTree (non-replicated) tables, there is no automatic peer to fetch from. You have two options:

Restore from a clickhouse-backup archive.
Manually copy the data parts directory from a peer that happens to hold the same data, then run SYSTEM RESTORE REPLICA or ATTACH PART per part.

For production, every MergeTree table that holds important data should either be ReplicatedMergeTree or covered by a backup schedule.

Common Pitfalls

Tar without -h. Archiving symlinks instead of real files leaves you with broken metadata after extraction. Always use tar -cvhf.
Mismatched ZooKeeper path. If srv2 registers a different replica name or path than what is in ZooKeeper, it will create a second replica rather than rejoining. Confirm macros.xml matches the original host.
Skipping force_restore_data. Without it, the server refuses to start tables whose data directories are missing, and you have to either delete the table metadata or set the flag and restart.
Insufficient disk on the recovering node. A full clone of a peer can take hours and a lot of bandwidth. Check disk capacity and network limits before starting.
Ongoing writes during recovery. Recovery still works under load, but the fetch queue grows as new parts arrive. For very high-throughput clusters, pause writes or throttle ingestion until the queue clears.
detached/ parts on the source. Parts in detached/ are not replicated. If important data lives there, copy it manually after the main recovery and run ALTER TABLE ... ATTACH PART.

Frequently Asked Questions

Q: Do I need to stop the healthy replica during recovery? A: No. srv1 keeps serving reads and writes while srv2 fetches parts in the background.

Q: How long does recovery take? A: Roughly the time to network-copy the dataset, bottlenecked by the slower of source disk read, network bandwidth, and target disk write. For a 1 TB dataset on a 1 Gbps link, expect 2 to 4 hours.

Q: What if I do not have a healthy replica? A: Then the cluster has no live source of truth. Restore from a clickhouse-backup archive using clickhouse-backup restore_remote <name>. If you also have no backups, the data is lost.

Q: Will the recovered replica become the leader? A: Leadership is per-table and changes automatically. After recovery, leadership may stay on srv1 or move to srv2 depending on activity. It does not affect correctness.

Q: Do I need to drop and recreate the ZooKeeper paths? A: No. The existing paths are what allow the recovering node to rejoin. Touching ZooKeeper paths during recovery is the most common way to lose data permanently.

Q: Can I use SYSTEM RESTORE REPLICA instead? A: Yes, for individual tables. After recreating the table DDL on the recovering node with the same ZooKeeper path, run SYSTEM RESTORE REPLICA <table>. The metadata-tar approach in this guide scales to many tables at once; SYSTEM RESTORE REPLICA is cleaner when you only need to rebuild a few.