The replication queue is the per-table to-do list that every ReplicatedMergeTree replica reads from Keeper. Entries describe parts to fetch, merges and mutations to run, partitions to drop, and metadata changes to apply. Each replica picks tasks off the queue and executes them in order, retrying on failure. Most of the time the queue is empty within seconds of an insert. When tasks pile up or fail repeatedly, this is the table to interrogate.
How the queue works
When a replica writes a new part, it appends an entry to the table's log in Keeper. Other replicas see the new log entries via watches, materialize them into their local system.replication_queue, and execute them. Entries are processed in order with bounded parallelism per replica. A failed task is retried with exponential backoff, accumulating num_tries and num_postponed counters until it either succeeds or you intervene.
The queue is durable: it lives in Keeper, not in memory. Restarting a replica does not lose the work it owes.
Monitoring query
A single aggregation query gives a good overview of what is happening across all replicated tables:
SELECT
database,
table,
type,
max(last_exception) AS last_exception,
max(postpone_reason) AS postpone_reason,
min(create_time) AS oldest_entry,
max(last_attempt_time) AS last_attempt,
max(last_postpone_time) AS last_postpone,
max(num_postponed) AS max_postponed,
max(num_tries) AS max_tries,
min(num_tries) AS min_tries,
countIf(last_exception != '') AS count_err,
countIf(num_postponed > 0) AS count_postponed,
countIf(is_currently_executing) AS count_executing,
count() AS count_all
FROM system.replication_queue
GROUP BY database, table, type
ORDER BY count_all DESC;
The columns mean:
| Column | Use |
|---|---|
type |
What kind of task this is (see below) |
count_all |
Total entries of this type for the table |
count_err |
Entries with a recent exception |
count_postponed |
Entries the server decided to retry later |
count_executing |
Entries currently in flight |
max_tries, max_postponed |
Highest retry counters seen |
oldest_entry |
Age of the longest-lived entry |
last_exception, postpone_reason |
The most informative diagnostic text |
A healthy table shows low counts, recent timestamps, and empty exceptions. Anything else is worth investigating.
Entry types
The type column identifies the kind of operation. The common ones:
GET_PART: fetch a part from another replica. Most common entry on a replica that received new data from elsewhere.MERGE_PARTS: merge several parts into one. The replica chosen to perform the merge produces the new part; others see aGET_PARTfor the result.MUTATE_PART: apply anALTER TABLE ... UPDATE/DELETEmutation to a part.DROP_RANGE: drop a range of parts, used byALTER TABLE ... DROP PARTITIONand TTL cleanup.REPLACE_RANGE: replace a range of parts atomically, used byATTACH PARTITION FROMandREPLACE PARTITION.ATTACH_PART: attach a specific part fromdetached/.
Knowing the type focuses diagnosis. A pile of GET_PART failures usually means a peer is unreachable. Stuck MUTATE_PART entries point to mutation problems. DROP_RANGE failures often mean a referenced part is missing.
Detail query for one table
When you know which table is in trouble, drill in:
SELECT
node_name,
type,
new_part_name,
create_time,
last_attempt_time,
num_tries,
num_postponed,
last_exception,
postpone_reason,
is_currently_executing
FROM system.replication_queue
WHERE database = 'mydatabase' AND table = 'mytable'
ORDER BY create_time;
This shows each entry in order. The combination of last_exception and postpone_reason usually tells you exactly what is blocking progress.
Common stuck scenarios
Tasks referencing dropped partitions. Occurs when a partition was removed mid-merge or after a partial restore. Re-issuing the drop, or detaching and reattaching the partition, typically clears them. A useful filter:
SELECT database, table, count(), sum(num_tries)
FROM system.replication_queue
GROUP BY database, table
HAVING count() > 100 OR sum(num_tries) > 1000;
Mutations finished but queue not cleared. Cross-check system.mutations:
SELECT m.database, m.table, m.mutation_id, m.is_done, q.type, q.num_tries
FROM system.mutations m
LEFT JOIN system.replication_queue q
ON m.database = q.database AND m.table = q.table
WHERE m.is_done = 1
AND q.type = 'MUTATE_PART'
AND q.num_tries > 0;
If is_done = 1 and the queue still shows the mutation, the queue entry is stale.
Missing source part for GET_PART. All replicas lost the part. The queue cannot heal itself; you need to either restore from backup or accept the data loss and use ALTER TABLE ... DROP PART to clear the reference.
Read-only replica blocking the log. If one replica cannot process the log, the global log can grow unbounded. Restore the read-only replica or drop it with SYSTEM DROP REPLICA so the others move on.
Remediation patterns
Generic recovery actions:
-- Force the replica to re-read its queue from Keeper
SYSTEM RESTART REPLICA mydatabase.mytable;
-- Remove a dead replica entirely so the rest of the cluster moves on
SYSTEM DROP REPLICA 'replica_name' FROM TABLE mydatabase.mytable;
-- Rebuild a replica's metadata while keeping local parts
SYSTEM RESTORE REPLICA mydatabase.mytable;
-- Tell ClickHouse a part is gone and stop trying to fetch it
ALTER TABLE mydatabase.mytable DROP PART 'part_name';
Apply the lightest action that works. SYSTEM RESTART REPLICA is non-destructive. SYSTEM DROP REPLICA should be used only when you are sure that replica is dead, otherwise you risk data loss if it held unique parts.
Common Pitfalls
- Reading
count_allwithoutlast_exception. A long queue might just be a busy ingest; the exceptions tell you whether it is actually broken. - Running
SYSTEM DROP REPLICAagainst the only replica holding a part. Always checksystem.partson the other replicas first. - Ignoring
postpone_reason. It often gives a more useful hint thanlast_exception. - Increasing parallelism (
background_pool_size) when the bottleneck is Keeper. The queue gets shorter only when the underlying error is fixed. - Letting
MUTATE_PARTfailures sit. Stuck mutations also block subsequent merges on the same parts.
Frequently Asked Questions
Q: Where is the replication queue stored?
A: In Keeper or ZooKeeper, under each table's znode. system.replication_queue is a per-replica view materialized from it.
Q: Can I edit entries by hand in Keeper? A: Strongly discouraged. Use ClickHouse system commands. Manual edits desync the cluster.
Q: Why is one replica's queue much longer than the others?
A: Either that replica is behind on processing (CPU, disk, or Keeper-bound) or its peers cannot reach it to deliver parts. Compare absolute_delay in system.replicas.
Q: Is the DDL queue the same as the replication queue?
A: No. DDL with ON CLUSTER goes through system.distributed_ddl_queue. Per-table replication tasks live in system.replication_queue. Both can be stuck independently.
Q: How can I clear stale entries safely?
A: Identify the offending part or partition, then issue the corresponding ALTER TABLE ... DROP PART or re-run the original DDL. Restart the replica afterward to force a fresh queue read.