Diagnose ClickHouse Replication and DDL Queue Problems

Q: What is the difference between `num_tries` and `num_postponed`?

num_tries counts actual execution attempts that failed. num_postponed counts deferred attempts where the server decided to retry later (for example, waiting for dependencies). Both growing without progress is bad.

Q: My DDL hangs with one node offline. Can I force it through?

Use distributed_ddl_task_timeout to bound how long the client waits. The task remains queued for the offline node; it completes when that node returns.

Q: How do I monitor replication health proactively?

Export system.replicas.absolute_delay , queue_size , and is_readonly to Prometheus. Alert on is_readonly , sustained queue_size , or absolute_delay beyond your SLA.

When ClickHouse replication misbehaves, the symptoms are usually visible long before queries fail outright: queue depth grows, replicas drift, or DDL statements hang on ON CLUSTER. The diagnostic path runs through a small set of system tables, system.replicas, system.replication_queue, system.distributed_ddl_queue, system.mutations, and system.errors. Walking through them in order tells you whether the cluster is busy, broken, or just slow.

Step 1: Inspect system.replicas

This is the first stop. system.replicas tells you whether each replica has a healthy Keeper session, whether it is read-only, and how far behind it is.

SELECT
    database,
    table,
    is_readonly,
    is_session_expired,
    future_parts,
    parts_to_check,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    absolute_delay,
    last_queue_update_exception,
    zookeeper_exception
FROM system.replicas
WHERE is_readonly OR queue_size > 0 OR last_queue_update_exception != '';

Things to watch for:

is_readonly = 1: the replica lost contact with Keeper or could not load metadata. Replication on this table is paused.
is_session_expired = 1: the Keeper session dropped. The server should reconnect automatically; if it does not, restart it.
zookeeper_exception: shows the last Keeper-side error, useful for connectivity problems.
absolute_delay: seconds behind the leader. Persistent growth means writes outrun replication.

Step 2: Drill into system.replication_queue

Once you know which tables are unhealthy, look at their queue entries.

SELECT
    database,
    table,
    type,
    max(last_exception)    AS last_exception,
    max(postpone_reason)    AS postpone_reason,
    min(create_time)        AS oldest,
    max(last_attempt_time)  AS last_attempt,
    max(num_postponed)      AS max_postponed,
    max(num_tries)          AS max_tries,
    countIf(last_exception != '')      AS count_err,
    countIf(num_postponed > 0)         AS count_postponed,
    countIf(is_currently_executing)    AS count_executing,
    count()                            AS count_all
FROM system.replication_queue
GROUP BY database, table, type
ORDER BY count_all DESC;

Signals of trouble:

Entries older than 24 hours. Healthy queues churn in seconds to minutes.
num_tries above a few hundred, or num_postponed climbing without progress.
Repeating last_exception strings. The text usually names the missing part or unreachable peer.

Typical exception patterns and what they mean:

Exception fragment	Likely cause
`No active replica has part`	All replicas were missing a part; quorum cannot heal automatically
`Cannot resolve host`	DNS or hostname issue, see the host resolution article
`Code: 999. Coordination::Exception`	Keeper connectivity or znode missing
`Part X already exists`	Race after partial restore or manual file move

Step 3: Check distributed DDL

DDL issued with ON CLUSTER runs through Keeper. Stuck DDL parks tasks in:

SELECT
    entry,
    cluster,
    query,
    initiator,
    status,
    exception_code,
    exception_text,
    query_create_time,
    query_finish_time
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time;

A node that never completes a DDL entry is usually offline or in read-only mode. Restart it or fix its replica state first, then the queue drains.

Step 4: Cross-check with mutations

When the replication queue mentions a mutation, confirm whether the mutation itself is finished:

SELECT database, table, mutation_id, command, is_done, latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;

If a mutation is is_done = 1 but the queue still has tasks referencing it, the queue entry is stale and can usually be cleared by retrying or by detaching and reattaching the affected partition.

Step 5: Look at system.errors and the part log

system.errors aggregates server-wide error counters and is a quick way to spot recurring problems:

SELECT name, value, last_error_time, last_error_message
FROM system.errors
WHERE value > 0
ORDER BY value DESC;

system.part_log captures historical part operations (merges, downloads, mutations) and is invaluable when reconstructing what happened to a specific part:

SELECT event_time, event_type, table, part_name, error, exception
FROM system.part_log
WHERE table = 'mytable' AND error != 0
ORDER BY event_time DESC
LIMIT 100;

Recovering read-only replicas

If system.replicas shows is_readonly = 1, the typical recovery is:

SYSTEM RESTART REPLICA mydatabase.mytable;

For deeper corruption where Keeper metadata is missing or wrong, generate a sequence of statements to drop and restore each affected replica. This bulk-recovery pattern is useful when many tables are read-only:

SELECT
    '-- Table ' || toString(row_number() OVER ()) || '\n' ||
    'DETACH TABLE `' || database || '`.`' || table || '`;\n' ||
    'SYSTEM DROP REPLICA ''' || replica_name ||
    ''' FROM ZKPATH ''' || zookeeper_path || ''';\n' ||
    'ATTACH TABLE `' || database || '`.`' || table || '`;\n' ||
    'SYSTEM RESTORE REPLICA `' || database || '`.`' || table || '`;\n'
    AS recovery_sql
FROM system.replicas
WHERE is_readonly = 1
FORMAT TSVRaw;

Execute the output sequentially per replica. Running these against multiple replicas in parallel causes race conditions in Keeper.

Common Pitfalls

Treating queue_size > 0 as automatic failure. Busy clusters often have a non-zero queue that drains quickly.
Restarting replicas without checking last_exception first. The cause repeats after restart.
Running SYSTEM DROP REPLICA on the only healthy copy. Verify another replica has the data.
Ignoring DDL queue when only one node is offline. The whole cluster waits for that node before considering the DDL finished.
Confusing mutation queue with replication queue. They are tracked separately and need separate diagnosis.

Frequently Asked Questions

Q: How long is too long for a task to sit in the replication queue? A: More than a few hours on an idle table is suspicious. More than 24 hours almost always indicates a real problem.

Q: What is the difference between num_tries and num_postponed? A: num_tries counts actual execution attempts that failed. num_postponed counts deferred attempts where the server decided to retry later (for example, waiting for dependencies). Both growing without progress is bad.

Q: Can I delete entries from system.replication_queue directly? A: No. The queue is in Keeper and edits must go through ClickHouse. Use SYSTEM DROP REPLICA and SYSTEM RESTORE REPLICA for surgical fixes.

Q: My DDL hangs with one node offline. Can I force it through? A: Use distributed_ddl_task_timeout to bound how long the client waits. The task remains queued for the offline node; it completes when that node returns.

Q: How do I monitor replication health proactively? A: Export system.replicas.absolute_delay, queue_size, and is_readonly to Prometheus. Alert on is_readonly, sustained queue_size, or absolute_delay beyond your SLA.