ClickHouse: Remove Stale block_numbers from ZooKeeper for Dropped Partitions

Q: Will this affect replication?

FORGET PARTITION is replicated through the normal DDL queue. It runs on every replica. Manual ZooKeeper deletes are not replicated and risk inconsistency between replicas.

When you drop a partition from a ReplicatedMergeTree table, ClickHouse removes the data parts but keeps the partition's entry under /block_numbers/<partition_id> in ZooKeeper. This is by design: the block number counter must remain stable so that out-of-order inserts to a previously-dropped partition cannot collide with old block IDs. Over years of operation on tables with high partition cardinality, those leftover znodes accumulate. They inflate ZooKeeper snapshot size, slow down recovery, and bloat memory on the Keeper ensemble. This article shows how to identify the orphans and remove them safely.

Detecting orphaned block_numbers

The query below joins system.zookeeper with system.parts across the cluster. Any block_numbers/<id> that has no matching part in any replica is a candidate for cleanup. The result emits a delete command per orphan, ready to feed to a Keeper client.

SELECT DISTINCT concat('delete ', zk.block_numbers_path, zk.partition_id) FROM
(
    SELECT r.database, r.table, zk.block_numbers_path, zk.partition_id, p.partition_id
    FROM
    (
        SELECT path AS block_numbers_path, name AS partition_id
        FROM system.zookeeper
        WHERE path IN (
            SELECT concat(zookeeper_path, '/block_numbers/') AS block_numbers_path
            FROM clusterAllReplicas('{cluster}', system.replicas)
        )
    ) AS zk
    LEFT JOIN
    (
        SELECT database, table, concat(zookeeper_path, '/block_numbers/') AS block_numbers_path
        FROM clusterAllReplicas('{cluster}', system.replicas)
    ) AS r ON (r.block_numbers_path = zk.block_numbers_path)
    LEFT JOIN
    (
        SELECT DISTINCT partition_id, database, table
        FROM clusterAllReplicas('{cluster}', system.parts)
    ) AS p ON (p.partition_id = zk.partition_id AND p.database = r.database AND p.table = r.table)
    WHERE p.partition_id = ''  AND zk.partition_id <> 'all'
    ORDER BY r.database, r.table, zk.block_numbers_path, zk.partition_id, p.partition_id
) t
FORMAT TSVRaw;

zk.partition_id <> 'all' excludes the special all partition used by unpartitioned tables. The output looks like a script of delete /clickhouse/tables/.../block_numbers/202101 lines you would feed to zkCli.sh or the clickhouse-keeper-client.

ClickHouse 24.3 and later: FORGET PARTITION

Manually deleting znodes is error-prone. From 24.3 onward, ClickHouse provides ALTER TABLE ... FORGET PARTITION which removes the block number entries through the server, with proper locking. The query below generates one ALTER TABLE statement per table, listing every orphan partition older than 120 days:

WITH
    now() - INTERVAL 120 DAY AS retain_old_partitions,
    replicas AS (
        SELECT DISTINCT database, table, zookeeper_path || '/block_numbers' AS block_numbers_path
        FROM system.replicas
    ),
    zk_data AS (
        SELECT DISTINCT name AS partition_id, path AS block_numbers_path
        FROM system.zookeeper
        WHERE path IN (SELECT block_numbers_path FROM replicas)
          AND mtime < retain_old_partitions
          AND partition_id <> 'all'
    ),
    zk_partitions AS (
        SELECT DISTINCT database, table, partition_id
        FROM replicas
        JOIN zk_data USING block_numbers_path
    ),
    partitions AS (
        SELECT DISTINCT database, table, partition_id
        FROM system.parts
    )
SELECT
    format(
        'ALTER TABLE `{}`.`{}` {};',
        database,
        table,
        arrayStringConcat(arraySort(groupArray('FORGET PARTITION ID \'' || partition_id || '\'')), ', ')
    ) AS query
FROM zk_partitions
WHERE (database, table, partition_id) NOT IN (SELECT * FROM partitions)
GROUP BY database, table
ORDER BY database, table
FORMAT TSVRaw;

Each generated row is a complete ALTER TABLE statement that you can execute as-is. The 120-day cutoff ensures recent partitions (where lingering inserts might still arrive) are left alone.

Cluster-aware variant

The version above runs against the local replica. To operate on every shard from one entry point, wrap the system tables in clusterAllReplicas and emit ON CLUSTER ALTERs:

WITH
    now() - INTERVAL 120 DAY AS retain_old_partitions,
    replicas AS (
        SELECT DISTINCT database, table, zookeeper_path || '/block_numbers' AS block_numbers_path
        FROM clusterAllReplicas('{cluster}', system.replicas)
    ),
    zk_data AS (
        SELECT DISTINCT name AS partition_id, path AS block_numbers_path
        FROM system.zookeeper
        WHERE path IN (SELECT block_numbers_path FROM replicas)
          AND mtime < retain_old_partitions
          AND partition_id <> 'all'
    ),
    zk_partitions AS (
        SELECT DISTINCT database, table, partition_id
        FROM replicas
        JOIN zk_data USING block_numbers_path
    ),
    partitions AS (
        SELECT DISTINCT database, table, partition_id
        FROM clusterAllReplicas('{cluster}', system.parts)
    )
SELECT
    format(
        'ALTER TABLE `{}`.`{}` ON CLUSTER \'\' {};',
        database,
        table,
        arrayStringConcat(arraySort(groupArray('FORGET PARTITION ID \'' || partition_id || '\'')), ', ')
    ) AS query
FROM zk_partitions
WHERE (database, table, partition_id) NOT IN (SELECT * FROM partitions)
GROUP BY database, table
ORDER BY database, table
FORMAT TSVRaw;

Running the generated commands safely

Run the detection query and save the output.
Eyeball it. Confirm none of the listed partitions are partitions you expect to receive future inserts.
Take a Keeper snapshot or backup before bulk operations.
Execute the ALTER TABLE ... FORGET PARTITION statements in a maintenance window.
Re-run the detection query to confirm the cleanup.

For the legacy delete form (pre-24.3), pipe the output to clickhouse-keeper-client or zkCli.sh. Always validate against a single small table first.

Common Pitfalls

Do not delete ZooKeeper paths under block_numbers/ while inserts to the same table are running. Block number allocation can race with the delete and produce duplicate parts.
FORGET PARTITION refuses to operate on partitions that still have parts. The detection query already filters those out; the server provides a second layer of safety.
The 120-day retention threshold is a default, not a rule. If you ingest historical data into year-old partitions, raise the cutoff or skip those tables.
partition_id is not the same as the partition expression. It is a hash. Make sure you are reading znode names, not user-facing partition values.
On Keeper (not classic ZooKeeper) the mtime field is the modification time of the znode in the local store. After a snapshot restore, the times reset. Treat the 120-day cutoff cautiously after a restore.
These queries require permission to read system.zookeeper. That privilege is restricted by default; grant it explicitly to the operator role.

Frequently Asked Questions

Q: Why does ClickHouse keep block_numbers znodes after a partition is dropped? A: The block number counter prevents duplicate part IDs if a late insert to the dropped partition arrives. Keeping the znode means future inserts to that partition get a fresh, non-overlapping block range.

Q: What does FORGET PARTITION actually do? A: It removes the partition's metadata from ZooKeeper, including the block_numbers entries. The partition is "forgotten" and any subsequent insert to the same partition gets fresh block numbers starting from 1.

Q: Will this affect replication? A: FORGET PARTITION is replicated through the normal DDL queue. It runs on every replica. Manual ZooKeeper deletes are not replicated and risk inconsistency between replicas.

Q: How much space can this free? A: Each block_numbers znode is small, on the order of bytes plus ZooKeeper overhead. On clusters with millions of stale entries the savings are significant, both in snapshot size and in Keeper RAM. On a fresh cluster, the gain is negligible.

Q: Is there a way to set automatic cleanup? A: Not directly. The server keeps block_numbers for safety. Schedule the detection-and-FORGET workflow as a maintenance cron, scoped to partitions older than your insert lookback window.