When ClickHouse replication misbehaves, the symptoms are usually visible long before queries fail outright: queue depth grows, replicas drift, or DDL statements hang on ON CLUSTER. The diagnostic path runs through a small set of system tables, system.replicas, system.replication_queue, system.distributed_ddl_queue, system.mutations, and system.errors. Walking through them in order tells you whether the cluster is busy, broken, or just slow.
Step 1: Inspect system.replicas
This is the first stop. system.replicas tells you whether each replica has a healthy Keeper session, whether it is read-only, and how far behind it is.
SELECT
database,
table,
is_readonly,
is_session_expired,
future_parts,
parts_to_check,
queue_size,
inserts_in_queue,
merges_in_queue,
absolute_delay,
last_queue_update_exception,
zookeeper_exception
FROM system.replicas
WHERE is_readonly OR queue_size > 0 OR last_queue_update_exception != '';
Things to watch for:
is_readonly = 1: the replica lost contact with Keeper or could not load metadata. Replication on this table is paused.is_session_expired = 1: the Keeper session dropped. The server should reconnect automatically; if it does not, restart it.zookeeper_exception: shows the last Keeper-side error, useful for connectivity problems.absolute_delay: seconds behind the leader. Persistent growth means writes outrun replication.
Step 2: Drill into system.replication_queue
Once you know which tables are unhealthy, look at their queue entries.
SELECT
database,
table,
type,
max(last_exception) AS last_exception,
max(postpone_reason) AS postpone_reason,
min(create_time) AS oldest,
max(last_attempt_time) AS last_attempt,
max(num_postponed) AS max_postponed,
max(num_tries) AS max_tries,
countIf(last_exception != '') AS count_err,
countIf(num_postponed > 0) AS count_postponed,
countIf(is_currently_executing) AS count_executing,
count() AS count_all
FROM system.replication_queue
GROUP BY database, table, type
ORDER BY count_all DESC;
Signals of trouble:
- Entries older than 24 hours. Healthy queues churn in seconds to minutes.
num_triesabove a few hundred, ornum_postponedclimbing without progress.- Repeating
last_exceptionstrings. The text usually names the missing part or unreachable peer.
Typical exception patterns and what they mean:
| Exception fragment | Likely cause |
|---|---|
No active replica has part |
All replicas were missing a part; quorum cannot heal automatically |
Cannot resolve host |
DNS or hostname issue, see the host resolution article |
Code: 999. Coordination::Exception |
Keeper connectivity or znode missing |
Part X already exists |
Race after partial restore or manual file move |
Step 3: Check distributed DDL
DDL issued with ON CLUSTER runs through Keeper. Stuck DDL parks tasks in:
SELECT
entry,
cluster,
query,
initiator,
status,
exception_code,
exception_text,
query_create_time,
query_finish_time
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time;
A node that never completes a DDL entry is usually offline or in read-only mode. Restart it or fix its replica state first, then the queue drains.
Step 4: Cross-check with mutations
When the replication queue mentions a mutation, confirm whether the mutation itself is finished:
SELECT database, table, mutation_id, command, is_done, latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;
If a mutation is is_done = 1 but the queue still has tasks referencing it, the queue entry is stale and can usually be cleared by retrying or by detaching and reattaching the affected partition.
Step 5: Look at system.errors and the part log
system.errors aggregates server-wide error counters and is a quick way to spot recurring problems:
SELECT name, value, last_error_time, last_error_message
FROM system.errors
WHERE value > 0
ORDER BY value DESC;
system.part_log captures historical part operations (merges, downloads, mutations) and is invaluable when reconstructing what happened to a specific part:
SELECT event_time, event_type, table, part_name, error, exception
FROM system.part_log
WHERE table = 'mytable' AND error != 0
ORDER BY event_time DESC
LIMIT 100;
Recovering read-only replicas
If system.replicas shows is_readonly = 1, the typical recovery is:
SYSTEM RESTART REPLICA mydatabase.mytable;
For deeper corruption where Keeper metadata is missing or wrong, generate a sequence of statements to drop and restore each affected replica. This bulk-recovery pattern is useful when many tables are read-only:
SELECT
'-- Table ' || toString(row_number() OVER ()) || '\n' ||
'DETACH TABLE `' || database || '`.`' || table || '`;\n' ||
'SYSTEM DROP REPLICA ''' || replica_name ||
''' FROM ZKPATH ''' || zookeeper_path || ''';\n' ||
'ATTACH TABLE `' || database || '`.`' || table || '`;\n' ||
'SYSTEM RESTORE REPLICA `' || database || '`.`' || table || '`;\n'
AS recovery_sql
FROM system.replicas
WHERE is_readonly = 1
FORMAT TSVRaw;
Execute the output sequentially per replica. Running these against multiple replicas in parallel causes race conditions in Keeper.
Common Pitfalls
- Treating
queue_size > 0as automatic failure. Busy clusters often have a non-zero queue that drains quickly. - Restarting replicas without checking
last_exceptionfirst. The cause repeats after restart. - Running
SYSTEM DROP REPLICAon the only healthy copy. Verify another replica has the data. - Ignoring DDL queue when only one node is offline. The whole cluster waits for that node before considering the DDL finished.
- Confusing mutation queue with replication queue. They are tracked separately and need separate diagnosis.
Frequently Asked Questions
Q: How long is too long for a task to sit in the replication queue? A: More than a few hours on an idle table is suspicious. More than 24 hours almost always indicates a real problem.
Q: What is the difference between num_tries and num_postponed?
A: num_tries counts actual execution attempts that failed. num_postponed counts deferred attempts where the server decided to retry later (for example, waiting for dependencies). Both growing without progress is bad.
Q: Can I delete entries from system.replication_queue directly?
A: No. The queue is in Keeper and edits must go through ClickHouse. Use SYSTEM DROP REPLICA and SYSTEM RESTORE REPLICA for surgical fixes.
Q: My DDL hangs with one node offline. Can I force it through?
A: Use distributed_ddl_task_timeout to bound how long the client waits. The task remains queued for the offline node; it completes when that node returns.
Q: How do I monitor replication health proactively?
A: Export system.replicas.absolute_delay, queue_size, and is_readonly to Prometheus. Alert on is_readonly, sustained queue_size, or absolute_delay beyond your SLA.