The error There are N unfinished hosts (0 of them are currently active) is raised by the initiator of an ON CLUSTER query after distributed_ddl_task_timeout elapses without all replicas reporting Finished. The "0 active" part is important: it means none of the lagging nodes are even attempting to execute the task right now. This is almost always a symptom of hostname resolution problems, dead replicas, or a DDL queue that has moved past entries some nodes never processed.
What the Error Actually Means
When you issue ALTER TABLE t ON CLUSTER c ..., the initiator writes a task to /clickhouse/task_queue/ddl/query-NNNNNNNNNN and waits up to distributed_ddl_task_timeout seconds (default 180) for every expected host to register itself under the task's finished/ child znode. If the timeout fires before completion, ClickHouse returns this error and counts how many hosts:
unfinished: never wrote a result intofinished/currently active: wrote intoactive/(meaning a DDLWorker is processing) but have not finished
0 active means the DDLWorker on the unfinished nodes did not even pick up the task. The query may still complete later if you just wait, but the initiator has already returned.
Root Causes
Node cannot identify itself as local
A node only executes a task if it recognises one of the cluster entries as itself. Check:
SELECT cluster, shard_num, replica_num, host_name, host_address, is_local
FROM system.clusters
WHERE cluster = 'your_cluster';
is_local must be 1 on exactly one row per node. The classic offender is Debian/Ubuntu mapping the hostname to 127.0.1.1 in /etc/hosts, so the resolved address never matches the network interface. Fix the host entry or set an explicit <interserver_http_host> in config.xml.
Heavy operations blocking the DDLWorker
DDLWorker runs DDL tasks serially per node. A long ALTER or OPTIMIZE blocks everything behind it. Look for in-flight work:
SHOW PROCESSLIST;
SELECT * FROM system.merges;
SELECT * FROM system.mutations WHERE is_done = 0;
Stuck task ahead of yours in the queue
DDLWorker processes tasks in order. A poisoned earlier task stalls every node behind it. Inspect the queue directly in ZooKeeper:
SELECT path, name, ctime, mtime
FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/'
ORDER BY ctime
SETTINGS allow_unrestricted_reads_from_keeper = 1;
For a specific task, count completed hosts:
SELECT name, numChildren AS finished_nodes
FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/'
AND name = 'finished'
SETTINGS allow_unrestricted_reads_from_keeper = 1;
Obsolete or dead replicas
Replicas that were decommissioned without SYSTEM DROP REPLICA still appear in the cluster's expected host list and will never finish DDL tasks:
SELECT database, table, zookeeper_path, total_replicas, active_replicas
FROM system.replicas
WHERE total_replicas != active_replicas;
SYSTEM DROP REPLICA 'dead_replica_name' FROM ZKPATH '/clickhouse/tables/01/my_table';
DDL queue path changed without a restart
Editing <distributed_ddl><path> in config.xml and reloading the config silently keeps DDLWorker watching the old path. New tasks land in the new path, get ignored, and time out. Always restart clickhouse-server when changing this setting.
Task purged by task_max_lifetime
If a node was offline longer than task_max_lifetime (default 7 days), tasks it should have run no longer exist in ZooKeeper. It cannot catch up on its own. Restore the node from a backup or recreate the affected tables manually.
Diagnostic Queries
Track DDL progress across the whole cluster in one query:
SELECT FQDN(), metric, value
FROM clusterAllReplicas('your_cluster', system.metrics)
WHERE metric LIKE '%MaxDDLEntryID%';
Find log entries for a specific task:
grep -C 40 "query-0000001000" /var/log/clickhouse-server/clickhouse-server*.log
Check the active list (hosts currently executing the task):
SELECT name, ctime, mtime
FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/active/'
SETTINGS allow_unrestricted_reads_from_keeper = 1;
Recovery Steps
Identify which hosts are unfinished by comparing the cluster's expected hosts against the
finished/children of the task znode.On each unfinished host, confirm
is_local = 1for its row insystem.clustersand thatDDLWorkeris running (ps -eL | grep DDL).Drop dead replicas with
SYSTEM DROP REPLICA.If the DDL queue is blocked by a single bad task, you can remove that znode from ZooKeeper with
zkCli.shafter stopping affected nodes, then restart them. This is destructive, so back up the queue first withCREATE TABLE ... AS SELECT * FROM system.distributed_ddl_queue.Re-issue the failed query on individual nodes without
ON CLUSTERif you need it to land immediately:ALTER TABLE my_table ADD COLUMN c String;As a last resort for severely lagging replicas, restore from backup rather than waiting for catch-up.
Increasing the Timeout
If your cluster legitimately needs more than 180 seconds for DDL to converge (large ALTERs, many shards), raise the timeout per query:
ALTER TABLE my_table ON CLUSTER c ADD COLUMN x Int32
SETTINGS distributed_ddl_task_timeout = 600;
Or globally in users.xml. Increasing the timeout only hides slow execution; it does not fix nodes that never pick up the task.
Common Pitfalls
- Treating the error as fatal. The DDL often completes on every node within a few minutes after the timeout fires. Re-check
system.distributed_ddl_queuebefore re-running. - Re-running
ON CLUSTERimmediately, which queues the same change a second time and can fail withTable already existsonce the original lands. - Setting
distributed_ddl_task_timeoutto a huge value to mask hostname mismatches. The query still hangs; you just notice later. - Forgetting that ClickHouse Keeper uses the same task paths. The diagnostic queries above work identically on Keeper-backed clusters.
Frequently Asked Questions
Q: Will the DDL eventually succeed if I ignore the error?
A: Often yes, provided the unfinished hosts come back online and the task has not been purged. Check system.distributed_ddl_queue for that entry's status before assuming it failed.
Q: How do I confirm exactly which hosts didn't finish?
A: Compare the children of /clickhouse/task_queue/ddl/query-XXXXX/finished/ against the expected host list from system.clusters for the cluster you targeted.
Q: Is it safe to delete a DDL task znode manually?
A: Only as a last resort. Stop affected clickhouse-server processes first, snapshot the queue, then use zkCli.sh deleteall /clickhouse/task_queue/ddl/query-NNNN. Restart afterwards.
Q: Does this error happen with ClickHouse Keeper too? A: Yes. Keeper is API-compatible with ZooKeeper, so DDLWorker behaves identically and the same failure modes apply.
Q: How do I prevent this in the future?
A: Validate hostname resolution on every node, remove dead replicas promptly with SYSTEM DROP REPLICA, avoid flooding the DDL queue from ETL jobs, and monitor system.distributed_ddl_queue for tasks that stay in non-Finished states.