ClickHouse Replication Queue: Monitoring and Troubleshooting

Q: Where is the replication queue stored?

In Keeper or ZooKeeper, under each table's znode. system.replication_queue is a per-replica view materialized from it.

Q: Is the DDL queue the same as the replication queue?

No. DDL with ON CLUSTER goes through system.distributed_ddl_queue . Per-table replication tasks live in system.replication_queue . Both can be stuck independently.

The replication queue is the per-table to-do list that every ReplicatedMergeTree replica reads from Keeper. Entries describe parts to fetch, merges and mutations to run, partitions to drop, and metadata changes to apply. Each replica picks tasks off the queue and executes them in order, retrying on failure. Most of the time the queue is empty within seconds of an insert. When tasks pile up or fail repeatedly, this is the table to interrogate.

How the queue works

When a replica writes a new part, it appends an entry to the table's log in Keeper. Other replicas see the new log entries via watches, materialize them into their local system.replication_queue, and execute them. Entries are processed in order with bounded parallelism per replica. A failed task is retried with exponential backoff, accumulating num_tries and num_postponed counters until it either succeeds or you intervene.

The queue is durable: it lives in Keeper, not in memory. Restarting a replica does not lose the work it owes.

Monitoring query

A single aggregation query gives a good overview of what is happening across all replicated tables:

SELECT
    database,
    table,
    type,
    max(last_exception)                  AS last_exception,
    max(postpone_reason)                 AS postpone_reason,
    min(create_time)                     AS oldest_entry,
    max(last_attempt_time)               AS last_attempt,
    max(last_postpone_time)              AS last_postpone,
    max(num_postponed)                   AS max_postponed,
    max(num_tries)                       AS max_tries,
    min(num_tries)                       AS min_tries,
    countIf(last_exception != '')        AS count_err,
    countIf(num_postponed > 0)           AS count_postponed,
    countIf(is_currently_executing)      AS count_executing,
    count()                              AS count_all
FROM system.replication_queue
GROUP BY database, table, type
ORDER BY count_all DESC;

The columns mean:

Column	Use
`type`	What kind of task this is (see below)
`count_all`	Total entries of this type for the table
`count_err`	Entries with a recent exception
`count_postponed`	Entries the server decided to retry later
`count_executing`	Entries currently in flight
`max_tries`, `max_postponed`	Highest retry counters seen
`oldest_entry`	Age of the longest-lived entry
`last_exception`, `postpone_reason`	The most informative diagnostic text

A healthy table shows low counts, recent timestamps, and empty exceptions. Anything else is worth investigating.

Entry types

The type column identifies the kind of operation. The common ones:

GET_PART: fetch a part from another replica. Most common entry on a replica that received new data from elsewhere.
MERGE_PARTS: merge several parts into one. The replica chosen to perform the merge produces the new part; others see a GET_PART for the result.
MUTATE_PART: apply an ALTER TABLE ... UPDATE/DELETE mutation to a part.
DROP_RANGE: drop a range of parts, used by ALTER TABLE ... DROP PARTITION and TTL cleanup.
REPLACE_RANGE: replace a range of parts atomically, used by ATTACH PARTITION FROM and REPLACE PARTITION.
ATTACH_PART: attach a specific part from detached/.

Knowing the type focuses diagnosis. A pile of GET_PART failures usually means a peer is unreachable. Stuck MUTATE_PART entries point to mutation problems. DROP_RANGE failures often mean a referenced part is missing.

Detail query for one table

When you know which table is in trouble, drill in:

SELECT
    node_name,
    type,
    new_part_name,
    create_time,
    last_attempt_time,
    num_tries,
    num_postponed,
    last_exception,
    postpone_reason,
    is_currently_executing
FROM system.replication_queue
WHERE database = 'mydatabase' AND table = 'mytable'
ORDER BY create_time;

This shows each entry in order. The combination of last_exception and postpone_reason usually tells you exactly what is blocking progress.

Common stuck scenarios

Tasks referencing dropped partitions. Occurs when a partition was removed mid-merge or after a partial restore. Re-issuing the drop, or detaching and reattaching the partition, typically clears them. A useful filter:

SELECT database, table, count(), sum(num_tries)
FROM system.replication_queue
GROUP BY database, table
HAVING count() > 100 OR sum(num_tries) > 1000;

Mutations finished but queue not cleared. Cross-check system.mutations:

SELECT m.database, m.table, m.mutation_id, m.is_done, q.type, q.num_tries
FROM system.mutations m
LEFT JOIN system.replication_queue q
  ON m.database = q.database AND m.table = q.table
WHERE m.is_done = 1
  AND q.type = 'MUTATE_PART'
  AND q.num_tries > 0;

If is_done = 1 and the queue still shows the mutation, the queue entry is stale.

Missing source part for GET_PART. All replicas lost the part. The queue cannot heal itself; you need to either restore from backup or accept the data loss and use ALTER TABLE ... DROP PART to clear the reference.

Read-only replica blocking the log. If one replica cannot process the log, the global log can grow unbounded. Restore the read-only replica or drop it with SYSTEM DROP REPLICA so the others move on.

Remediation patterns

Generic recovery actions:

-- Force the replica to re-read its queue from Keeper
SYSTEM RESTART REPLICA mydatabase.mytable;

-- Remove a dead replica entirely so the rest of the cluster moves on
SYSTEM DROP REPLICA 'replica_name' FROM TABLE mydatabase.mytable;

-- Rebuild a replica's metadata while keeping local parts
SYSTEM RESTORE REPLICA mydatabase.mytable;

-- Tell ClickHouse a part is gone and stop trying to fetch it
ALTER TABLE mydatabase.mytable DROP PART 'part_name';

Apply the lightest action that works. SYSTEM RESTART REPLICA is non-destructive. SYSTEM DROP REPLICA should be used only when you are sure that replica is dead, otherwise you risk data loss if it held unique parts.

Common Pitfalls

Reading count_all without last_exception. A long queue might just be a busy ingest; the exceptions tell you whether it is actually broken.
Running SYSTEM DROP REPLICA against the only replica holding a part. Always check system.parts on the other replicas first.
Ignoring postpone_reason. It often gives a more useful hint than last_exception.
Increasing parallelism (background_pool_size) when the bottleneck is Keeper. The queue gets shorter only when the underlying error is fixed.
Letting MUTATE_PART failures sit. Stuck mutations also block subsequent merges on the same parts.

Frequently Asked Questions

Q: Where is the replication queue stored? A: In Keeper or ZooKeeper, under each table's znode. system.replication_queue is a per-replica view materialized from it.

Q: Can I edit entries by hand in Keeper? A: Strongly discouraged. Use ClickHouse system commands. Manual edits desync the cluster.

Q: Why is one replica's queue much longer than the others? A: Either that replica is behind on processing (CPU, disk, or Keeper-bound) or its peers cannot reach it to deliver parts. Compare absolute_delay in system.replicas.

Q: Is the DDL queue the same as the replication queue? A: No. DDL with ON CLUSTER goes through system.distributed_ddl_queue. Per-table replication tasks live in system.replication_queue. Both can be stuck independently.

Q: How can I clear stale entries safely? A: Identify the offending part or partition, then issue the corresponding ALTER TABLE ... DROP PART or re-run the original DDL. Restart the replica afterward to force a fresh queue read.