ClickHouse: Recreate Table After Replication Queue Corruption

Q: Do I need to drop the table on every replica?

Yes. Otherwise the new CREATE TABLE collides with the old znode in Keeper and replication coordination stays broken.

When system.replication_queue is full of unrecoverable tasks, ZooKeeper paths are inconsistent, and SYSTEM RESTART REPLICA no longer fixes the situation, the last-resort move is to detach every part on the best surviving replica, drop the broken replicated table, then reattach the parts into a clean table. This preserves your data while throwing away all of the corrupted coordination state. It requires full downtime on the table and should only be attempted after lighter remedies, SYSTEM RESTART REPLICA, SYSTEM DROP REPLICA plus SYSTEM RESTORE REPLICA, have failed.

When to use this procedure

Use this only when:

The replication queue has tasks that cannot be cleared by retrying or by dropping referenced partitions.
SYSTEM RESTORE REPLICA does not bring the table back to a usable state.
You have identified one replica with the most complete and consistent active parts.

If the corruption is limited to one replica, the simpler fix is to drop that replica with SYSTEM DROP REPLICA and let it re-clone from the others. The procedure below applies when every replica is in trouble.

Preparation

Pick the best replica. The one with the most active parts and the freshest data. Treat this as the source of truth.
Freeze the data as a safety net:
```
ALTER TABLE mydatabase.mybadtable FREEZE;
```
This creates hardlinks in shadow/ so you can roll back if anything goes wrong.
Stop all writers. Pause ingestion pipelines, materialized view sources, and any application that touches the table. Stop ad-hoc queries too. The procedure assumes a quiet table.

Step 1: Clean any pre-existing detached parts

Stale detached parts will collide with what we are about to detach. Generate drop statements for whatever sits in system.detached_parts:

SELECT concat(
    'ALTER TABLE ', database, '.', table,
    ' DROP DETACHED PART \'', name, '\' SETTINGS allow_drop_detached = 1;'
)
FROM system.detached_parts
WHERE database = 'mydatabase' AND table = 'mybadtable'
FORMAT TSVRaw;

Execute the output, then verify cleanup:

SELECT count()
FROM system.detached_parts
WHERE database = 'mydatabase' AND table = 'mybadtable';

The count must be zero before continuing.

Step 2: Detach all active parts

Generate DETACH PARTITION commands for every active partition on the chosen replica:

SELECT concat(
    'ALTER TABLE ', database, '.', table,
    ' DETACH PARTITION ID \'', partition_id, '\';'
) AS detach
FROM system.parts
WHERE active = 1
  AND database = 'mydatabase'
  AND table = 'mybadtable'
GROUP BY detach
ORDER BY detach ASC
FORMAT TSVRaw;

Run the generated statements. After this the table is empty from the engine's point of view, and the data lives under detached/.

Step 3: Drop the broken replicated table

Drop the table on every replica. With the queue corrupted, expect to use DROP TABLE ... SYNC or SYSTEM DROP REPLICA on the ZooKeeper path so the znode is cleared as well. Confirm in Keeper that the table's path under /clickhouse/tables/... is gone before recreating.

Step 4: Recreate the table with the same schema

Create the table again on each replica, using the same engine parameters and ZooKeeper path. Use the original CREATE TABLE statement from your schema repository or SHOW CREATE TABLE output captured earlier. The empty table will register fresh metadata in Keeper.

Step 5: Reattach parts on the source replica

Only on the replica that holds the detached parts, regenerate attach statements from system.detached_parts:

SELECT concat(
    'ALTER TABLE ', database, '.', table,
    ' ATTACH PART \'', a.name, '\';'
)
FROM system.detached_parts AS a
WHERE database = 'mydatabase' AND table = 'mybadtable'
FORMAT TSVRaw;

Execute the output. Other replicas pick the parts up through normal replication once they are attached on the source.

Step 6: Validate

Confirm the part count, partition count, and total size look right:

SELECT
    formatReadableSize(sum(bytes)) AS size,
    sum(rows)                     AS rows,
    count()                        AS part_count,
    uniqExact(partition)           AS partition_count
FROM system.parts
WHERE active = 1
  AND database = 'mydatabase'
  AND table = 'mybadtable';

Compare against a pre-recovery snapshot. Then check the other replicas are catching up:

SELECT replica_name, log_max_index, log_pointer, queue_size
FROM system.replicas
WHERE table = 'mybadtable';

Once queue_size drains to zero on every replica, restart writers.

Common Pitfalls

Skipping the FREEZE step. Without it there is no rollback if a detach corrupts further.
Running the recovery while ingestion is live. New parts arriving mid-procedure get lost.
Forgetting to clear system.detached_parts first. Old detached parts collide with new ones by name.
Dropping the table before its znode is removed in Keeper. Recreation fails with a Replica path already exists style error.
Attaching parts on multiple replicas simultaneously. Let one replica attach, then let replication ship the parts to the others.

Frequently Asked Questions

Q: Is data ever lost? A: Only if the chosen source replica was missing parts that other replicas had. Pick the replica with the highest active part count to minimize this.

Q: How long does this take? A: Detach and reattach are metadata operations and run in seconds even for large tables. The slow part is the other replicas re-fetching parts from the source, bounded by network throughput.

Q: Do I need to drop the table on every replica? A: Yes. Otherwise the new CREATE TABLE collides with the old znode in Keeper and replication coordination stays broken.

Q: Can I run this on a live table without downtime? A: No. The detach step temporarily empties the table from the engine's perspective. Writes during this window will fail or land in the wrong state.

Q: What if SYSTEM RESTORE REPLICA would have been enough? A: Try it first. It is the lighter-weight fix and works when the local part data is intact but the Keeper znode is missing. Only use the recreate procedure when restore does not converge.