ClickHouse Error: There are N Unfinished Hosts (0 Currently Active)

Q: How do I confirm exactly which hosts didn't finish?

Compare the children of /clickhouse/task_queue/ddl/query-XXXXX/finished/ against the expected host list from system.clusters for the cluster you targeted.

Q: Is it safe to delete a DDL task znode manually?

Only as a last resort. Stop affected clickhouse-server processes first, snapshot the queue, then use zkCli.sh deleteall /clickhouse/task_queue/ddl/query-NNNN . Restart afterwards.

The error There are N unfinished hosts (0 of them are currently active) is raised by the initiator of an ON CLUSTER query after distributed_ddl_task_timeout elapses without all replicas reporting Finished. The "0 active" part is important: it means none of the lagging nodes are even attempting to execute the task right now. This is almost always a symptom of hostname resolution problems, dead replicas, or a DDL queue that has moved past entries some nodes never processed.

What the Error Actually Means

When you issue ALTER TABLE t ON CLUSTER c ..., the initiator writes a task to /clickhouse/task_queue/ddl/query-NNNNNNNNNN and waits up to distributed_ddl_task_timeout seconds (default 180) for every expected host to register itself under the task's finished/ child znode. If the timeout fires before completion, ClickHouse returns this error and counts how many hosts:

unfinished: never wrote a result into finished/
currently active: wrote into active/ (meaning a DDLWorker is processing) but have not finished

0 active means the DDLWorker on the unfinished nodes did not even pick up the task. The query may still complete later if you just wait, but the initiator has already returned.

Root Causes

Node cannot identify itself as local

A node only executes a task if it recognises one of the cluster entries as itself. Check:

SELECT cluster, shard_num, replica_num, host_name, host_address, is_local
FROM system.clusters
WHERE cluster = 'your_cluster';

is_local must be 1 on exactly one row per node. The classic offender is Debian/Ubuntu mapping the hostname to 127.0.1.1 in /etc/hosts, so the resolved address never matches the network interface. Fix the host entry or set an explicit <interserver_http_host> in config.xml.

Heavy operations blocking the DDLWorker

DDLWorker runs DDL tasks serially per node. A long ALTER or OPTIMIZE blocks everything behind it. Look for in-flight work:

SHOW PROCESSLIST;
SELECT * FROM system.merges;
SELECT * FROM system.mutations WHERE is_done = 0;

Stuck task ahead of yours in the queue

DDLWorker processes tasks in order. A poisoned earlier task stalls every node behind it. Inspect the queue directly in ZooKeeper:

SELECT path, name, ctime, mtime
FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/'
ORDER BY ctime
SETTINGS allow_unrestricted_reads_from_keeper = 1;

For a specific task, count completed hosts:

SELECT name, numChildren AS finished_nodes
FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/'
  AND name = 'finished'
SETTINGS allow_unrestricted_reads_from_keeper = 1;

Obsolete or dead replicas

Replicas that were decommissioned without SYSTEM DROP REPLICA still appear in the cluster's expected host list and will never finish DDL tasks:

SELECT database, table, zookeeper_path, total_replicas, active_replicas
FROM system.replicas
WHERE total_replicas != active_replicas;

SYSTEM DROP REPLICA 'dead_replica_name' FROM ZKPATH '/clickhouse/tables/01/my_table';

DDL queue path changed without a restart

Editing <distributed_ddl><path> in config.xml and reloading the config silently keeps DDLWorker watching the old path. New tasks land in the new path, get ignored, and time out. Always restart clickhouse-server when changing this setting.

Task purged by `task_max_lifetime`

If a node was offline longer than task_max_lifetime (default 7 days), tasks it should have run no longer exist in ZooKeeper. It cannot catch up on its own. Restore the node from a backup or recreate the affected tables manually.

Diagnostic Queries

Track DDL progress across the whole cluster in one query:

SELECT FQDN(), metric, value
FROM clusterAllReplicas('your_cluster', system.metrics)
WHERE metric LIKE '%MaxDDLEntryID%';

Find log entries for a specific task:

grep -C 40 "query-0000001000" /var/log/clickhouse-server/clickhouse-server*.log

Check the active list (hosts currently executing the task):

SELECT name, ctime, mtime
FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/active/'
SETTINGS allow_unrestricted_reads_from_keeper = 1;

Recovery Steps

Identify which hosts are unfinished by comparing the cluster's expected hosts against the finished/ children of the task znode.
On each unfinished host, confirm is_local = 1 for its row in system.clusters and that DDLWorker is running (ps -eL | grep DDL).
Drop dead replicas with SYSTEM DROP REPLICA.
If the DDL queue is blocked by a single bad task, you can remove that znode from ZooKeeper with zkCli.sh after stopping affected nodes, then restart them. This is destructive, so back up the queue first with CREATE TABLE ... AS SELECT * FROM system.distributed_ddl_queue.
Re-issue the failed query on individual nodes without ON CLUSTER if you need it to land immediately:
```
ALTER TABLE my_table ADD COLUMN c String;
```
As a last resort for severely lagging replicas, restore from backup rather than waiting for catch-up.

Increasing the Timeout

If your cluster legitimately needs more than 180 seconds for DDL to converge (large ALTERs, many shards), raise the timeout per query:

ALTER TABLE my_table ON CLUSTER c ADD COLUMN x Int32
SETTINGS distributed_ddl_task_timeout = 600;

Or globally in users.xml. Increasing the timeout only hides slow execution; it does not fix nodes that never pick up the task.

Common Pitfalls

Treating the error as fatal. The DDL often completes on every node within a few minutes after the timeout fires. Re-check system.distributed_ddl_queue before re-running.
Re-running ON CLUSTER immediately, which queues the same change a second time and can fail with Table already exists once the original lands.
Setting distributed_ddl_task_timeout to a huge value to mask hostname mismatches. The query still hangs; you just notice later.
Forgetting that ClickHouse Keeper uses the same task paths. The diagnostic queries above work identically on Keeper-backed clusters.

Frequently Asked Questions

Q: Will the DDL eventually succeed if I ignore the error? A: Often yes, provided the unfinished hosts come back online and the task has not been purged. Check system.distributed_ddl_queue for that entry's status before assuming it failed.

Q: How do I confirm exactly which hosts didn't finish? A: Compare the children of /clickhouse/task_queue/ddl/query-XXXXX/finished/ against the expected host list from system.clusters for the cluster you targeted.

Q: Is it safe to delete a DDL task znode manually? A: Only as a last resort. Stop affected clickhouse-server processes first, snapshot the queue, then use zkCli.sh deleteall /clickhouse/task_queue/ddl/query-NNNN. Restart afterwards.

Q: Does this error happen with ClickHouse Keeper too? A: Yes. Keeper is API-compatible with ZooKeeper, so DDLWorker behaves identically and the same failure modes apply.

Q: How do I prevent this in the future? A: Validate hostname resolution on every node, remove dead replicas promptly with SYSTEM DROP REPLICA, avoid flooding the DDL queue from ETL jobs, and monitor system.distributed_ddl_queue for tasks that stay in non-Finished states.