NEW

Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

Diagnose ClickHouse Replication and DDL Queue Problems

When ClickHouse replication misbehaves, the symptoms are usually visible long before queries fail outright: queue depth grows, replicas drift, or DDL statements hang on ON CLUSTER. The diagnostic path runs through a small set of system tables, system.replicas, system.replication_queue, system.distributed_ddl_queue, system.mutations, and system.errors. Walking through them in order tells you whether the cluster is busy, broken, or just slow.

Step 1: Inspect system.replicas

This is the first stop. system.replicas tells you whether each replica has a healthy Keeper session, whether it is read-only, and how far behind it is.

SELECT
    database,
    table,
    is_readonly,
    is_session_expired,
    future_parts,
    parts_to_check,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    absolute_delay,
    last_queue_update_exception,
    zookeeper_exception
FROM system.replicas
WHERE is_readonly OR queue_size > 0 OR last_queue_update_exception != '';

Things to watch for:

  • is_readonly = 1: the replica lost contact with Keeper or could not load metadata. Replication on this table is paused.
  • is_session_expired = 1: the Keeper session dropped. The server should reconnect automatically; if it does not, restart it.
  • zookeeper_exception: shows the last Keeper-side error, useful for connectivity problems.
  • absolute_delay: seconds behind the leader. Persistent growth means writes outrun replication.

Step 2: Drill into system.replication_queue

Once you know which tables are unhealthy, look at their queue entries.

SELECT
    database,
    table,
    type,
    max(last_exception)    AS last_exception,
    max(postpone_reason)    AS postpone_reason,
    min(create_time)        AS oldest,
    max(last_attempt_time)  AS last_attempt,
    max(num_postponed)      AS max_postponed,
    max(num_tries)          AS max_tries,
    countIf(last_exception != '')      AS count_err,
    countIf(num_postponed > 0)         AS count_postponed,
    countIf(is_currently_executing)    AS count_executing,
    count()                            AS count_all
FROM system.replication_queue
GROUP BY database, table, type
ORDER BY count_all DESC;

Signals of trouble:

  • Entries older than 24 hours. Healthy queues churn in seconds to minutes.
  • num_tries above a few hundred, or num_postponed climbing without progress.
  • Repeating last_exception strings. The text usually names the missing part or unreachable peer.

Typical exception patterns and what they mean:

Exception fragment Likely cause
No active replica has part All replicas were missing a part; quorum cannot heal automatically
Cannot resolve host DNS or hostname issue, see the host resolution article
Code: 999. Coordination::Exception Keeper connectivity or znode missing
Part X already exists Race after partial restore or manual file move

Step 3: Check distributed DDL

DDL issued with ON CLUSTER runs through Keeper. Stuck DDL parks tasks in:

SELECT
    entry,
    cluster,
    query,
    initiator,
    status,
    exception_code,
    exception_text,
    query_create_time,
    query_finish_time
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY query_create_time;

A node that never completes a DDL entry is usually offline or in read-only mode. Restart it or fix its replica state first, then the queue drains.

Step 4: Cross-check with mutations

When the replication queue mentions a mutation, confirm whether the mutation itself is finished:

SELECT database, table, mutation_id, command, is_done, latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time;

If a mutation is is_done = 1 but the queue still has tasks referencing it, the queue entry is stale and can usually be cleared by retrying or by detaching and reattaching the affected partition.

Step 5: Look at system.errors and the part log

system.errors aggregates server-wide error counters and is a quick way to spot recurring problems:

SELECT name, value, last_error_time, last_error_message
FROM system.errors
WHERE value > 0
ORDER BY value DESC;

system.part_log captures historical part operations (merges, downloads, mutations) and is invaluable when reconstructing what happened to a specific part:

SELECT event_time, event_type, table, part_name, error, exception
FROM system.part_log
WHERE table = 'mytable' AND error != 0
ORDER BY event_time DESC
LIMIT 100;

Recovering read-only replicas

If system.replicas shows is_readonly = 1, the typical recovery is:

SYSTEM RESTART REPLICA mydatabase.mytable;

For deeper corruption where Keeper metadata is missing or wrong, generate a sequence of statements to drop and restore each affected replica. This bulk-recovery pattern is useful when many tables are read-only:

SELECT
    '-- Table ' || toString(row_number() OVER ()) || '\n' ||
    'DETACH TABLE `' || database || '`.`' || table || '`;\n' ||
    'SYSTEM DROP REPLICA ''' || replica_name ||
    ''' FROM ZKPATH ''' || zookeeper_path || ''';\n' ||
    'ATTACH TABLE `' || database || '`.`' || table || '`;\n' ||
    'SYSTEM RESTORE REPLICA `' || database || '`.`' || table || '`;\n'
    AS recovery_sql
FROM system.replicas
WHERE is_readonly = 1
FORMAT TSVRaw;

Execute the output sequentially per replica. Running these against multiple replicas in parallel causes race conditions in Keeper.

Common Pitfalls

  • Treating queue_size > 0 as automatic failure. Busy clusters often have a non-zero queue that drains quickly.
  • Restarting replicas without checking last_exception first. The cause repeats after restart.
  • Running SYSTEM DROP REPLICA on the only healthy copy. Verify another replica has the data.
  • Ignoring DDL queue when only one node is offline. The whole cluster waits for that node before considering the DDL finished.
  • Confusing mutation queue with replication queue. They are tracked separately and need separate diagnosis.

Frequently Asked Questions

Q: How long is too long for a task to sit in the replication queue? A: More than a few hours on an idle table is suspicious. More than 24 hours almost always indicates a real problem.

Q: What is the difference between num_tries and num_postponed? A: num_tries counts actual execution attempts that failed. num_postponed counts deferred attempts where the server decided to retry later (for example, waiting for dependencies). Both growing without progress is bad.

Q: Can I delete entries from system.replication_queue directly? A: No. The queue is in Keeper and edits must go through ClickHouse. Use SYSTEM DROP REPLICA and SYSTEM RESTORE REPLICA for surgical fixes.

Q: My DDL hangs with one node offline. Can I force it through? A: Use distributed_ddl_task_timeout to bound how long the client waits. The task remains queued for the offline node; it completes when that node returns.

Q: How do I monitor replication health proactively? A: Export system.replicas.absolute_delay, queue_size, and is_readonly to Prometheus. Alert on is_readonly, sustained queue_size, or absolute_delay beyond your SLA.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.