ClickHouse Error: Rows in Filesystem Are Suspicious

Q: Can I run this without taking the server down?

If the table is already attached but in a degraded state, SYSTEM RESTORE REPLICA works online. The full procedure with metadata renaming is for the case where startup itself fails.

When a Replicated* table starts up, ClickHouse compares the parts on local disk against what ZooKeeper says the replica should contain. If too many parts differ, startup is aborted with the error X rows of Y total rows in filesystem are suspicious. This safety check exists to stop ClickHouse from corrupting a replica by reattaching a wrong filesystem layout to a healthy ZooKeeper state. This guide explains the cause and the recovery procedure.

The Check Behind the Error

The safety ratio is controlled by replicated_max_ratio_of_wrong_parts, with a default of 0.5 (50%). If the number of parts that exist on disk but not in ZooKeeper (or vice versa) divided by total parts exceeds the ratio, ClickHouse refuses to attach the table.

On a small table, even a single bad part can push the ratio above the default. On a large table, you need a substantial drift before the check trips.

Common Causes

Hard restart after a crash. A single part written just before a kernel panic might be missing on disk while ZooKeeper still references it. On a small table, this trips the 50% ratio immediately.

Storage policy changes. Removing a disk from a policy, then adding it back later, causes parts to disappear from the visible filesystem and reappear. ClickHouse compares filesystem state to ZooKeeper at startup and sees the discrepancy.

Manual filesystem manipulation. Anyone who copies, moves, or deletes part directories outside ClickHouse will create exactly this drift.

Restored backups. Restoring /var/lib/clickhouse/data/... from a backup without coordinating with ZooKeeper produces a stale replica state.

Recovery on ClickHouse 21.7 and Later

The reliable procedure uses SYSTEM RESTORE REPLICA:

Stop the server.
Move the table's SQL metadata aside so the server starts without trying to attach the table:

sudo mv /var/lib/clickhouse/metadata/default/tbl.sql /tmp/tbl.sql.bak

Start the server. The table will not be attached.
Drop the stale replica entry from ZooKeeper:

SYSTEM DROP REPLICA 'replica-0' FROM ZKPATH '/clickhouse/tables/0/default/tbl';

Replace 'replica-0' with the value of the <replica> macro for this node, and the ZKPATH with the actual ZooKeeper path used by your table.

Restore the metadata file:

sudo mv /tmp/tbl.sql.bak /var/lib/clickhouse/metadata/default/tbl.sql

Reattach the table:

ATTACH TABLE default.tbl;

Tell ClickHouse to reconcile the replica with peers:

SYSTEM RESTORE REPLICA default.tbl;
SYSTEM SYNC REPLICA default.tbl;

SYSTEM RESTORE REPLICA re-registers the local parts with ZooKeeper, then SYSTEM SYNC REPLICA downloads anything still missing from healthy replicas.

Quick Workaround: Raise the Ratio

If the drift is known to be safe (for example, you just removed and re-added a disk and understand that ZooKeeper still has the truth), raise the ratio temporarily:

ALTER TABLE default.tbl
MODIFY SETTING replicated_max_ratio_of_wrong_parts = 1.0;

Or globally in merge_tree:

<merge_tree>
    <replicated_max_ratio_of_wrong_parts>1.0</replicated_max_ratio_of_wrong_parts>
</merge_tree>

This skips the check at startup. Use it only with full understanding of the cause, because it can mask real corruption.

Verifying After Recovery

Confirm the replica is in good shape:

SELECT
    database, table, is_leader, is_readonly,
    absolute_delay, queue_size, inserts_in_queue, merges_in_queue
FROM system.replicas
WHERE table = 'tbl';

is_readonly = 0, low absolute_delay, and a draining queue_size mean the replica is back to normal.

Common Pitfalls

Running SYSTEM RESTORE REPLICA before dropping the old ZooKeeper replica path. The new state collides with the stale one.
Forgetting to put the metadata file back. The table never reappears in the schema.
Using the wrong ZKPATH in DROP REPLICA. Look at system.zookeeper to verify the path exists.
Setting replicated_max_ratio_of_wrong_parts = 1.0 globally and forgetting to revert. You disable an important safety check.

Frequently Asked Questions

Q: Is my data lost? A: Usually not. For Replicated* tables, the data still exists either locally or on peer replicas. The error stops startup so you can choose how to reconcile.

Q: Why is the default ratio 0.5? A: It balances catching real corruption against allowing recovery from small discrepancies. On large tables it is sensitive enough; on small tables, even one bad part trips it.

Q: Does SYSTEM RESTORE REPLICA delete data? A: No. It re-registers existing local parts in ZooKeeper. Nothing on disk is removed.

Q: Can I run this without taking the server down? A: If the table is already attached but in a degraded state, SYSTEM RESTORE REPLICA works online. The full procedure with metadata renaming is for the case where startup itself fails.

Q: How is this different from "Suspiciously many broken parts"? A: That error is about parts that fail checksum or have corrupted files locally. This one is about disagreement between local filesystem and ZooKeeper. They have separate settings and separate recovery paths.