NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

ClickHouse Parts Consistency Check Queries

Every active part in a ClickHouse MergeTree table has a min_block_number and max_block_number. After merges and replication, the block numbers across all active parts in a partition should form a contiguous sequence. A gap usually means a part was lost, never replicated, or removed by a recovery procedure. Likewise, ZooKeeper holds metadata for every part on a Replicated table, and any drift between ZooKeeper and the local disk is a signal of replication issues.

The three queries below cover the most common consistency checks: missing blocks within a partition, total block counts, and ZooKeeper-versus-disk drift. Run them when investigating data loss alerts, replica desynchronization, or after a failed restore.

1. Find Missing Block Numbers

This query scans active parts in a single partition, walks the sorted block range, and reports any gaps:

SELECT
    database,
    table,
    partition_id,
    ranges.1 AS previous_part,
    ranges.2 AS next_part,
    ranges.3 AS previous_block_number,
    ranges.4 AS next_block_number,
    range(toUInt64(previous_block_number + 1), toUInt64(next_block_number)) AS missing_block_numbers
FROM
(
    WITH
        arrayPopFront(groupArray(min_block_number) AS min) AS min_adj,
        arrayPopBack(groupArray(max_block_number) AS max) AS max_adj,
        arrayFilter((x, y, z) -> (y != (z + 1)),
            arrayZip(arrayPopBack(groupArray(name) AS name_arr), arrayPopFront(name_arr), max_adj, min_adj),
            min_adj, max_adj) AS missing_ranges
    SELECT
        database,
        table,
        partition_id,
        missing_ranges
    FROM
    (
        SELECT *
        FROM system.parts
        WHERE active AND (table = 'query_thread_log') AND (partition_id = '202108') AND active
        ORDER BY min_block_number ASC
    )
    GROUP BY database, table, partition_id
)
ARRAY JOIN missing_ranges AS ranges;

Replace table and partition_id with the partition under investigation. Each output row identifies the part on either side of a gap (previous_part and next_part), the block numbers at the boundary, and the full list of missing block numbers between them.

A gap is not always a problem. Some block numbers are reserved internally and never appear in a part. But long gaps, or gaps that overlap with rows you expected to see, are strong evidence of data loss.

2. Count Total Blocks in a Partition

The companion check confirms how many block numbers have been allocated for a partition:

SELECT
    database,
    table,
    partition_id,
    sum(max_block_number - min_block_number) AS blocks_count
FROM system.parts
WHERE active AND (table = 'query_thread_log') AND (partition_id = '202108') AND active
GROUP BY database, table, partition_id;

Compare blocks_count against the number of inserts you expect. The two numbers will not match exactly because each insert can produce multiple block numbers depending on data volume and max_insert_block_size, but the order of magnitude should line up.

3. Compare ZooKeeper Parts With Disk Parts

On a Replicated table, ZooKeeper holds a record for every part the replica is supposed to own. Drift between ZooKeeper and the local disk indicates either an orphan ZooKeeper entry or a part that was removed without notifying the keeper.

SELECT
    zoo.p_path AS part_zoo,
    zoo.ctime,
    zoo.mtime,
    disk.p_path AS part_disk
FROM
(
    SELECT concat(path, '/', name) AS p_path, ctime, mtime
    FROM system.zookeeper
    WHERE path IN (SELECT concat(replica_path, '/parts') FROM system.replicas)
) AS zoo
LEFT JOIN
(
    SELECT concat(replica_path, '/parts/', name) AS p_path
    FROM system.parts
    INNER JOIN system.replicas USING (database, table)
) AS disk ON zoo.p_path = disk.p_path
WHERE part_disk = '' AND zoo.mtime <= now() - INTERVAL 1 HOUR
ORDER BY part_zoo;

This returns ZooKeeper paths for parts the keeper believes the replica owns but that do not exist on disk. The one-hour filter on mtime skips parts in active fetch or merge windows, since transient drift is normal during those operations.

Interpreting the Output

Result Meaning Action
No rows from query 1 Partition is contiguous No action needed
Small gaps in query 1 Internal block reservations Compare with blocks_count from query 2
Large or persistent gaps Data loss or failed replication Investigate logs, consider restore
Rows from query 3 Orphan ZooKeeper entries Investigate before cleaning

Recovery Path When Drift is Found

If query 3 returns orphan ZooKeeper entries, do not delete them blindly. Confirm first whether the part exists on another replica:

SELECT hostName(), name, active, partition_id
FROM clusterAllReplicas('{cluster}', system.parts)
WHERE name = 'part_name';

If another replica holds the part, SYSTEM SYNC REPLICA on the lagging node usually pulls it back. If no replica has the part, the entry is truly orphaned. Remove the stale ZooKeeper znode with a Keeper client (clickhouse-keeper-client or zkCli.sh) and then run SYSTEM RESTART REPLICA db_name.table_name so the replica re-reads metadata from ZooKeeper.

Common Pitfalls

  • These queries hit system.zookeeper, which can be expensive on large clusters. Scope to one database or table at a time.
  • A small number of skipped block numbers per partition is normal and not a sign of loss.
  • Query 3 filters out parts younger than one hour. Lowering that window adds noise from in-flight fetches.
  • ReplicatedMergeTree quorum failures can leave ZooKeeper ahead of disk. Confirm quorum settings before assuming local corruption.
  • CHECK TABLE performs the equivalent of these checks plus checksum verification on every part. It is slower but more authoritative when triaging a specific table.

Frequently Asked Questions

Q: What does a gap in block numbers mean in ClickHouse? A: A gap means the block numbers between two active parts were not retained. Some gaps are internal reservations, but large or persistent gaps indicate a lost or never-replicated insert.

Q: How do I detect orphaned ZooKeeper parts? A: Run query 3 against system.zookeeper and system.parts. Entries that exist in ZooKeeper but not on disk for more than an hour are likely orphans.

Q: Should I delete every part from ZooKeeper that does not exist on disk? A: No. First check if another replica holds the part. Run SYSTEM SYNC REPLICA to recover before considering manual cleanup.

Q: Is CHECK TABLE enough for this kind of investigation? A: CHECK TABLE verifies checksums and existence of parts. It complements the block-number checks but does not directly report gaps in block numbering.

Q: How often should I run these consistency checks? A: As part of a monthly maintenance review, after any unplanned ZooKeeper outage, and immediately after a restore from backup.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.