When you drop a partition from a ReplicatedMergeTree table, ClickHouse removes the data parts but keeps the partition's entry under /block_numbers/<partition_id> in ZooKeeper. This is by design: the block number counter must remain stable so that out-of-order inserts to a previously-dropped partition cannot collide with old block IDs. Over years of operation on tables with high partition cardinality, those leftover znodes accumulate. They inflate ZooKeeper snapshot size, slow down recovery, and bloat memory on the Keeper ensemble. This article shows how to identify the orphans and remove them safely.
Detecting orphaned block_numbers
The query below joins system.zookeeper with system.parts across the cluster. Any block_numbers/<id> that has no matching part in any replica is a candidate for cleanup. The result emits a delete command per orphan, ready to feed to a Keeper client.
SELECT DISTINCT concat('delete ', zk.block_numbers_path, zk.partition_id) FROM
(
SELECT r.database, r.table, zk.block_numbers_path, zk.partition_id, p.partition_id
FROM
(
SELECT path AS block_numbers_path, name AS partition_id
FROM system.zookeeper
WHERE path IN (
SELECT concat(zookeeper_path, '/block_numbers/') AS block_numbers_path
FROM clusterAllReplicas('{cluster}', system.replicas)
)
) AS zk
LEFT JOIN
(
SELECT database, table, concat(zookeeper_path, '/block_numbers/') AS block_numbers_path
FROM clusterAllReplicas('{cluster}', system.replicas)
) AS r ON (r.block_numbers_path = zk.block_numbers_path)
LEFT JOIN
(
SELECT DISTINCT partition_id, database, table
FROM clusterAllReplicas('{cluster}', system.parts)
) AS p ON (p.partition_id = zk.partition_id AND p.database = r.database AND p.table = r.table)
WHERE p.partition_id = '' AND zk.partition_id <> 'all'
ORDER BY r.database, r.table, zk.block_numbers_path, zk.partition_id, p.partition_id
) t
FORMAT TSVRaw;
zk.partition_id <> 'all' excludes the special all partition used by unpartitioned tables. The output looks like a script of delete /clickhouse/tables/.../block_numbers/202101 lines you would feed to zkCli.sh or the clickhouse-keeper-client.
ClickHouse 24.3 and later: FORGET PARTITION
Manually deleting znodes is error-prone. From 24.3 onward, ClickHouse provides ALTER TABLE ... FORGET PARTITION which removes the block number entries through the server, with proper locking. The query below generates one ALTER TABLE statement per table, listing every orphan partition older than 120 days:
WITH
now() - INTERVAL 120 DAY AS retain_old_partitions,
replicas AS (
SELECT DISTINCT database, table, zookeeper_path || '/block_numbers' AS block_numbers_path
FROM system.replicas
),
zk_data AS (
SELECT DISTINCT name AS partition_id, path AS block_numbers_path
FROM system.zookeeper
WHERE path IN (SELECT block_numbers_path FROM replicas)
AND mtime < retain_old_partitions
AND partition_id <> 'all'
),
zk_partitions AS (
SELECT DISTINCT database, table, partition_id
FROM replicas
JOIN zk_data USING block_numbers_path
),
partitions AS (
SELECT DISTINCT database, table, partition_id
FROM system.parts
)
SELECT
format(
'ALTER TABLE `{}`.`{}` {};',
database,
table,
arrayStringConcat(arraySort(groupArray('FORGET PARTITION ID \'' || partition_id || '\'')), ', ')
) AS query
FROM zk_partitions
WHERE (database, table, partition_id) NOT IN (SELECT * FROM partitions)
GROUP BY database, table
ORDER BY database, table
FORMAT TSVRaw;
Each generated row is a complete ALTER TABLE statement that you can execute as-is. The 120-day cutoff ensures recent partitions (where lingering inserts might still arrive) are left alone.
Cluster-aware variant
The version above runs against the local replica. To operate on every shard from one entry point, wrap the system tables in clusterAllReplicas and emit ON CLUSTER ALTERs:
WITH
now() - INTERVAL 120 DAY AS retain_old_partitions,
replicas AS (
SELECT DISTINCT database, table, zookeeper_path || '/block_numbers' AS block_numbers_path
FROM clusterAllReplicas('{cluster}', system.replicas)
),
zk_data AS (
SELECT DISTINCT name AS partition_id, path AS block_numbers_path
FROM system.zookeeper
WHERE path IN (SELECT block_numbers_path FROM replicas)
AND mtime < retain_old_partitions
AND partition_id <> 'all'
),
zk_partitions AS (
SELECT DISTINCT database, table, partition_id
FROM replicas
JOIN zk_data USING block_numbers_path
),
partitions AS (
SELECT DISTINCT database, table, partition_id
FROM clusterAllReplicas('{cluster}', system.parts)
)
SELECT
format(
'ALTER TABLE `{}`.`{}` ON CLUSTER \'\' {};',
database,
table,
arrayStringConcat(arraySort(groupArray('FORGET PARTITION ID \'' || partition_id || '\'')), ', ')
) AS query
FROM zk_partitions
WHERE (database, table, partition_id) NOT IN (SELECT * FROM partitions)
GROUP BY database, table
ORDER BY database, table
FORMAT TSVRaw;
Running the generated commands safely
- Run the detection query and save the output.
- Eyeball it. Confirm none of the listed partitions are partitions you expect to receive future inserts.
- Take a Keeper snapshot or backup before bulk operations.
- Execute the
ALTER TABLE ... FORGET PARTITIONstatements in a maintenance window. - Re-run the detection query to confirm the cleanup.
For the legacy delete form (pre-24.3), pipe the output to clickhouse-keeper-client or zkCli.sh. Always validate against a single small table first.
Common Pitfalls
- Do not
deleteZooKeeper paths underblock_numbers/while inserts to the same table are running. Block number allocation can race with the delete and produce duplicate parts. FORGET PARTITIONrefuses to operate on partitions that still have parts. The detection query already filters those out; the server provides a second layer of safety.- The 120-day retention threshold is a default, not a rule. If you ingest historical data into year-old partitions, raise the cutoff or skip those tables.
partition_idis not the same as thepartitionexpression. It is a hash. Make sure you are reading znode names, not user-facing partition values.- On Keeper (not classic ZooKeeper) the
mtimefield is the modification time of the znode in the local store. After a snapshot restore, the times reset. Treat the 120-day cutoff cautiously after a restore. - These queries require permission to read
system.zookeeper. That privilege is restricted by default; grant it explicitly to the operator role.
Frequently Asked Questions
Q: Why does ClickHouse keep block_numbers znodes after a partition is dropped? A: The block number counter prevents duplicate part IDs if a late insert to the dropped partition arrives. Keeping the znode means future inserts to that partition get a fresh, non-overlapping block range.
Q: What does FORGET PARTITION actually do? A: It removes the partition's metadata from ZooKeeper, including the block_numbers entries. The partition is "forgotten" and any subsequent insert to the same partition gets fresh block numbers starting from 1.
Q: Will this affect replication?
A: FORGET PARTITION is replicated through the normal DDL queue. It runs on every replica. Manual ZooKeeper deletes are not replicated and risk inconsistency between replicas.
Q: How much space can this free? A: Each block_numbers znode is small, on the order of bytes plus ZooKeeper overhead. On clusters with millions of stale entries the savings are significant, both in snapshot size and in Keeper RAM. On a fresh cluster, the gain is negligible.
Q: Is there a way to set automatic cleanup? A: Not directly. The server keeps block_numbers for safety. Schedule the detection-and-FORGET workflow as a maintenance cron, scoped to partitions older than your insert lookback window.