ClickHouse Merge Stalls in Replicated Tables

Q: What is the difference between SYSTEM RESTART REPLICA and SYSTEM RESTORE REPLICA?

SYSTEM RESTART REPLICA reconnects the ZooKeeper session and rebuilds the in-memory queue from existing ZooKeeper metadata. Use it when data is present and ZooKeeper metadata is intact but the session or queue state is stale. SYSTEM RESTORE REPLICA is for the case where ZooKeeper metadata itself has been lost (e.g., after ZooKeeper data directory deletion). It detaches all local parts, recreates the ZooKeeper metadata structure as if the table were new, then reattaches the parts. Using RESTORE when RESTART is sufficient risks unnecessary data disruption.

Q: Does OPTIMIZE TABLE bypass the replication system on ReplicatedMergeTree?

No. OPTIMIZE TABLE on a ReplicatedMergeTree writes a merge request to the shared ZooKeeper replication log. All replicas pull that entry into their own queues and execute the merge independently. It is not a local-only operation. The alter_sync session setting controls how long the client waits after issuing the command: 0 is fire-and-forget, 1 waits for the current replica, 2 waits for all active replicas.

Q: Can I permanently disable merges on one replica to offload CPU?

Yes. Set always_fetch_merged_part = 1 as a ReplicatedMergeTree table setting. The replica will fetch merge results from other replicas instead of executing merges locally. This setting is stored in ZooKeeper table metadata and persists across restarts. Alternatively, set max_replicated_merges_in_queue = 0 to prevent the replica from pulling any MERGE_PARTS entries into its queue.

Q: SYSTEM STOP MERGES was not enough to stop background merging — why?

SYSTEM STOP MERGES only prevents new merge selection on the current server process. It does not stop merges that are already executing, and it does not affect other replicas in the cluster. Additionally, it is not persistent — a server restart resumes merges automatically. For replicated tables, the replication log continues to accumulate entries from the leader; STOP MERGES only delays their execution locally.

Q: What does num_tries exceeding 97,000 in system.replication_queue indicate?

Extremely high num_tries with the error Source part metadata version X is newer than the table metadata version Y. ALTER_METADATA is still in progress is a known regression introduced in ClickHouse 24.1.5.6. It occurs when upgrading from 23.7 and causes MERGE_PARTS entries to retry indefinitely without succeeding. The fix is to upgrade to a version where this regression has been resolved.

Merge stalls in ReplicatedMergeTree tables are among the most operationally disruptive failures in ClickHouse. Symptoms build gradually: system.merges shows the same entries for hours, system.replication_queue fills with MERGE_PARTS entries that never clear, active part counts climb steadily, and eventually INSERT queries fail with DB::Exception: Too many parts. The cluster appears healthy — ZooKeeper is up, the network is fine — but no merges complete. This article explains the mechanics behind the failure, how to identify which cause applies to your cluster, and the precise steps to recover.

What This Error Means

ReplicatedMergeTree coordinates merges across replicas using ZooKeeper (or ClickHouse Keeper). One or more replicas that hold is_leader = 1 in system.replicas select which parts to merge and write MERGE_PARTS entries to the shared ZooKeeper replication log at /clickhouse/tables/{shard}/{table}/log. Every replica pulls log entries into its own /replicas/<name>/queue and executes the merge independently.

A merge stall means one or more stages of this pipeline have stopped advancing:

The leader is not scheduling new merges (no new MERGE_PARTS entries appear in the log).
Entries are in the queue but none are being executed (is_currently_executing = 0 everywhere).
Merges are executing but making no progress (progress in system.merges stays near 0.0 for extended periods).

When the replication queue backs up beyond parts_to_delay_insert (default: 1000 active parts per partition since ClickHouse 23.6; 150 before), inserts slow down. When it reaches parts_to_throw_insert (default: 3000 parts per partition since 23.6; 300 before), inserts are rejected entirely.

Common Causes

Circular GET_PART / MERGE_PARTS queue dependency. A MERGE_PARTS entry cannot proceed because a required source part is listed as being fetched in a GET_PART entry, while that GET_PART is blocked because the part it targets falls inside the range the MERGE_PARTS would produce. Neither task can execute; the queue loops indefinitely. The ClickHouse log reports: Not executing log entry queue-XXXXXXXXX of type MERGE_PARTS for part <part_name> because part <source_part> is not ready yet (log entry for that part is being processed).
Background merge pool exhaustion. The pool shared by all merge and mutation tasks has background_pool_size threads (default: 16) and an effective concurrency ceiling of background_pool_size * background_merges_mutations_concurrency_ratio tasks (default: 16 * 2 = 32). When all slots are occupied by long-running or stuck merges, no new merges can start. system.metrics will show BackgroundMergesAndMutationsPoolTask at or near the ceiling.
ZooKeeper or ClickHouse Keeper session expiry. When a replica loses its ZooKeeper session, is_readonly = 1 and is_session_expired = 1 appear in system.replicas. The replica stops scheduling merges and cannot process queue entries until the session is restored. ZooKeeper JVM garbage collection pauses are a common trigger on busy clusters.
ZooKeeper connection race leaving queue out of sync. After a ZooKeeper connection loss and reconnect, the log_pointer may be updated in ZooKeeper without the corresponding queue entries being added to the in-memory queue. The replica thinks it is current but is actually missing tasks. SYSTEM RESTART REPLICA reconciles the in-memory queue state with ZooKeeper.
Part size ceiling blocking all candidates. Background merges will not select a set of parts whose total compressed size exceeds max_bytes_to_merge_at_max_space_in_pool (default: 150 GB). If every partition contains a single large part already close to or above this threshold, the scheduler finds nothing mergeable and the queue grows with new small parts that never get consolidated.
Stuck mutations consuming all pool slots. Mutations (ALTER UPDATE, ALTER DELETE) share the same background pool as merges. A stuck or very slow mutation occupying many slots can starve merges entirely. system.mutations will show pending entries with non-empty latest_fail_reason.
Version-specific metadata version regression. In ClickHouse 24.1.5.6, a regression caused MERGE_PARTS entries to accumulate with very high num_tries (exceeding 97,000) when upgrading from 23.7. The error in last_exception reads: Source part metadata version X is newer than the table metadata version Y. ALTER_METADATA is still in progress. Upgrading past 24.1.5.6 resolves this.

How to Fix

Step 1: Identify the active cause

Check for stuck merges across all cluster nodes:

-- Identify merges that have been running for a long time with little progress
SELECT
    hostName() AS host,
    database,
    table,
    result_part_name,
    round(elapsed / 60, 1) AS elapsed_minutes,
    round(progress * 100, 2) AS progress_pct,
    formatReadableSize(total_size_bytes_compressed) AS compressed_size,
    num_parts
FROM clusterAllReplicas(default, system.merges)
WHERE elapsed > 3600
   OR (elapsed > 600 AND progress < 0.01)
ORDER BY elapsed DESC;

Check the replication queue for high-retry entries:

SELECT
    hostName() AS host,
    database,
    table,
    type,
    new_part_name,
    is_currently_executing,
    num_tries,
    num_postponed,
    postpone_reason,
    last_exception,
    dateDiff('minute', last_attempt_time, now()) AS minutes_since_last_attempt
FROM clusterAllReplicas(default, system.replication_queue)
WHERE type IN ('MERGE_PARTS', 'GET_PART')
  AND (num_tries > 100 OR dateDiff('hour', create_time, now()) > 1)
ORDER BY num_tries DESC
LIMIT 50;

Check replica health and session state:

SELECT
    hostName() AS host,
    database,
    table,
    is_leader,
    is_readonly,
    is_session_expired,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    absolute_delay,
    last_queue_update
FROM clusterAllReplicas(default, system.replicas)
WHERE is_readonly = 1
   OR is_session_expired = 1
   OR merges_in_queue > 100
   OR absolute_delay > 300
ORDER BY absolute_delay DESC;

Check background pool saturation:

SELECT metric, value, description
FROM system.metrics
WHERE metric IN (
    'BackgroundMergesAndMutationsPoolTask',
    'BackgroundFetchesPoolTask',
    'BackgroundSchedulePoolTask',
    'ReplicatedChecks'
);

Step 2: Recover from a stale or desynchronized queue

For a replica with a stale in-memory queue (ZooKeeper metadata is intact but the session was interrupted), reinitialize the replica:

SYSTEM RESTART REPLICA mydb.mytable;

This reconnects the ZooKeeper session and rebuilds the in-memory queue from existing ZooKeeper metadata. It is the correct first-line recovery for is_readonly = 1 caused by session expiry or a post-reconnect queue desync. Do not confuse this with SYSTEM RESTORE REPLICA, which is for the separate case where ZooKeeper metadata has been lost but local part data still exists.

Step 3: Resolve a circular GET_PART / MERGE_PARTS deadlock

When postpone_reason in system.replication_queue references part dependency cycles and num_tries is high with is_currently_executing = 0, SYSTEM RESTART REPLICA will usually break the deadlock by rebuilding the queue state. If the deadlock persists after a restart, stopping and restarting merges on the affected table clears the in-memory lock:

SYSTEM STOP MERGES mydb.mytable;
SYSTEM START MERGES mydb.mytable;

Note that SYSTEM STOP MERGES is not persistent across server restarts. It only holds an in-memory lock. After a server restart, merges resume automatically. Use it only as a short-term diagnostic step.

Step 4: Force-merge a specific partition

Once the queue blockage is cleared, manually trigger a merge on partitions with high part counts:

-- Trigger merge on a specific partition; wait for current replica to complete
SET alter_sync = 1;
OPTIMIZE TABLE mydb.mytable PARTITION '202401';

-- Force full compaction to one part per partition, ignoring the
-- max_bytes_to_merge_at_max_space_in_pool size ceiling
OPTIMIZE TABLE mydb.mytable FINAL;

alter_sync (formerly replication_alter_partitions_sync) is a session-level setting. Value 1 waits for the current replica only; value 2 waits for all active replicas and can block indefinitely if any replica is unhealthy. OPTIMIZE FINAL bypasses max_bytes_to_merge_at_max_space_in_pool and uses only available free disk space as the limit.

Step 5: Handle stuck mutations exhausting pool slots

Check for pending mutations that are failing or stalled:

SELECT database, table, command, parts_to_do, is_done, latest_fail_reason
FROM system.mutations
WHERE NOT is_done
ORDER BY create_time;

Kill mutations that are failing and blocking pool slots:

KILL MUTATION WHERE mutation_id = 'mutation_id_here'
  AND database = 'mydb' AND table = 'mytable';

Step 6: Clean up a dead replica from ZooKeeper

If a decommissioned replica was never cleanly removed and its MERGE_PARTS entries are blocking progress, remove its ZooKeeper metadata:

-- Remove a dead replica's ZooKeeper metadata (does not affect the local/live replica)
SYSTEM DROP REPLICA 'dead_replica_hostname' FROM TABLE mydb.mytable;

-- When the table has already been dropped locally:
SYSTEM DROP REPLICA 'dead_replica_hostname' FROM ZKPATH '/clickhouse/tables/01/mydb/mytable';

SYSTEM DROP REPLICA cannot remove the currently running local replica. Use DROP TABLE for that.

Root-Cause Analysis

Identify partitions accumulating parts faster than merges can process them:

SELECT
    database,
    table,
    partition_id,
    count() AS active_parts,
    formatReadableSize(sum(bytes_on_disk)) AS total_size_on_disk
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
HAVING active_parts > 50
ORDER BY active_parts DESC;

Inspect ZooKeeper exception counts to correlate connectivity problems with replication instability:

SELECT event, value
FROM system.events
WHERE event LIKE '%ZooKeeper%'
ORDER BY event;

A rising ZooKeeperHardwareExceptions value alongside replica is_readonly = 1 confirms ZooKeeper-driven session expiry as the root cause.

Review recent merge completions — or gaps where merges stopped — using the part log:

SELECT
    event_time,
    database,
    table,
    event_type,
    part_name,
    duration_ms,
    rows,
    formatReadableSize(size_in_bytes) AS size
FROM system.part_log
WHERE event_type = 'MergeParts'
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY event_time DESC
LIMIT 50;

Detect the circular GET_PART / MERGE_PARTS pattern explicitly:

SELECT
    rq.type,
    rq.new_part_name,
    rq.postpone_reason,
    rq.num_tries,
    rq.last_exception
FROM system.replication_queue AS rq
WHERE rq.is_currently_executing = 0
  AND rq.num_tries > 0
  AND rq.type IN ('MERGE_PARTS', 'GET_PART')
ORDER BY rq.type, rq.create_time;

Preventive Measures

Monitor system.replication_queue for entries with num_tries > 100 or non-empty postpone_reason and alert before the backlog becomes critical.
Track BackgroundMergesAndMutationsPoolTask in system.metrics. When it consistently approaches background_pool_size * background_merges_mutations_concurrency_ratio, the pool is saturated and new merges will queue.
Watch absolute_delay in system.replicas per host. A replica with growing delay and is_readonly = 1 needs immediate attention.
Keep part counts per partition below parts_to_delay_insert (default: 1000 since 23.6). Batch inserts rather than writing many tiny parts.
Use ClickHouse Keeper instead of Apache ZooKeeper. Keeper uses a Raft-based C++ implementation that avoids JVM GC pauses, uses significantly less memory, and has fewer known replication edge cases.
Keep all replicas on the same ClickHouse minor version. Mixed versions can introduce metadata version mismatches that block MERGE_PARTS execution.
Set execute_merges_on_single_replica_time_threshold on high-replication-factor clusters to have only one replica execute the merge and let others fetch the result, reducing cluster-wide CPU and pool saturation risk.
Run SYSTEM SYNC REPLICA mydb.mytable LIGHTWEIGHT after significant operational changes (failovers, upgrades) to confirm the replication queue has drained data-movement tasks.

Resolve Merge Stalls Automatically with Pulse

Pulse monitors system.merges, system.replication_queue, and system.replicas continuously across your ClickHouse cluster. It surfaces stuck merges, high-retry queue entries, pool saturation, and replica read-only transitions as actionable alerts before part counts reach the insert-rejection threshold. Rather than manually correlating metrics across multiple system tables during an incident, Pulse identifies the root cause and recommends the specific recovery steps for your cluster's state.

Frequently Asked Questions

Q: Why does restarting the ClickHouse server temporarily fix stuck merges?
A: A restart rebuilds the in-memory queue state from ZooKeeper and releases all in-memory locks (including SYSTEM STOP MERGES locks). This clears transient deadlocks like circular GET_PART / MERGE_PARTS dependencies. It does not fix the underlying cause — ZooKeeper connectivity instability, pool exhaustion, or version mismatches — so the stall typically recurs. Use SYSTEM RESTART REPLICA first, since it achieves the same queue reconciliation without a full service disruption.

Q: What is the difference between SYSTEM RESTART REPLICA and SYSTEM RESTORE REPLICA?
A: SYSTEM RESTART REPLICA reconnects the ZooKeeper session and rebuilds the in-memory queue from existing ZooKeeper metadata. Use it when data is present and ZooKeeper metadata is intact but the session or queue state is stale. SYSTEM RESTORE REPLICA is for the case where ZooKeeper metadata itself has been lost (e.g., after ZooKeeper data directory deletion). It detaches all local parts, recreates the ZooKeeper metadata structure as if the table were new, then reattaches the parts. Using RESTORE when RESTART is sufficient risks unnecessary data disruption.

Q: Does OPTIMIZE TABLE bypass the replication system on ReplicatedMergeTree?
A: No. OPTIMIZE TABLE on a ReplicatedMergeTree writes a merge request to the shared ZooKeeper replication log. All replicas pull that entry into their own queues and execute the merge independently. It is not a local-only operation. The alter_sync session setting controls how long the client waits after issuing the command: 0 is fire-and-forget, 1 waits for the current replica, 2 waits for all active replicas.

Q: Can I permanently disable merges on one replica to offload CPU?
A: Yes. Set always_fetch_merged_part = 1 as a ReplicatedMergeTree table setting. The replica will fetch merge results from other replicas instead of executing merges locally. This setting is stored in ZooKeeper table metadata and persists across restarts. Alternatively, set max_replicated_merges_in_queue = 0 to prevent the replica from pulling any MERGE_PARTS entries into its queue.

Q: SYSTEM STOP MERGES was not enough to stop background merging — why?
A: SYSTEM STOP MERGES only prevents new merge selection on the current server process. It does not stop merges that are already executing, and it does not affect other replicas in the cluster. Additionally, it is not persistent — a server restart resumes merges automatically. For replicated tables, the replication log continues to accumulate entries from the leader; STOP MERGES only delays their execution locally.

Q: What does num_tries exceeding 97,000 in system.replication_queue indicate?
A: Extremely high num_tries with the error Source part metadata version X is newer than the table metadata version Y. ALTER_METADATA is still in progress is a known regression introduced in ClickHouse 24.1.5.6. It occurs when upgrading from 23.7 and causes MERGE_PARTS entries to retry indefinitely without succeeding. The fix is to upgrade to a version where this regression has been resolved.