ClickHouse OPTIMIZE FINAL Timeout: Causes and Fixes

Q: If the client times out, do I need to re-run OPTIMIZE FINAL?

Almost never. The merge keeps running on the server. Query system.merges to confirm, then wait or come back later.

Q: Which timeout matters for `OPTIMIZE TABLE ... ON CLUSTER FINAL`?

distributed_ddl_task_timeout bounds how long the client waits for every shard to finish. Defaults to 180 seconds; raise it for any non-trivial ON CLUSTER operation.

Q: How do I tell whether the merge is still running after a timeout?

SELECT * FROM system.merges WHERE table = ' ' . If a row exists with non-zero progress and elapsed , the merge is alive.

Q: Does the same advice apply to `INSERT ... SELECT` timeouts?

Yes. INSERT ... SELECT , CREATE TABLE ... AS SELECT , and CREATE MATERIALIZED VIEW ... POPULATE all keep running after client timeout. Same timeouts apply, same monitoring approach.

OPTIMIZE TABLE ... FINAL rewrites every part of a table (or partition) into a single merged part. On large tables this can take minutes or hours, and almost all clients time out long before the work finishes. The errors look alarming (Timeout exceeded, Cancelling query, executing longer than distributed_ddl_task_timeout) but they describe the client connection, not the merge itself. This guide explains which timeout you are hitting, how to raise it, and how to monitor the operation that is still running on the server.

The Key Behavior: Timeouts Cancel the Client, Not the Merge

When a client times out waiting for OPTIMIZE FINAL, the server keeps merging. The connection drops, the client raises an exception, but the merge thread on the server continues until completion or failure. Cancelling the query in the client (Ctrl+C) does not cancel the underlying merge either.

This means two things:

Re-running OPTIMIZE FINAL after a timeout often hits Cannot start mutation/merge ... because the previous one is still in flight.
The right response is usually to look at system.merges and wait, not to retry.

The same behavior applies to other long-running write operations: INSERT ... SELECT, CREATE TABLE ... AS SELECT, and CREATE MATERIALIZED VIEW ... POPULATE.

Typical Error Messages

Timeout exceeded while receiving data from server

Code: 159. DB::Exception: Watching task ... is executing longer
than distributed_ddl_task_timeout (=180) seconds.

Cancelling query. ... Query was cancelled.

The third message is misleading. The query was cancelled in the client; the merge on the server is still running.

Timeouts That Apply

There are two layers of timeout to consider.

Client connection timeouts

For native TCP clients (clickhouse-client, JDBC, Go driver, Python driver):

send_timeout: how long the server waits to send a chunk to the client.
receive_timeout: how long the server waits to receive from the client.
tcp_keep_alive_timeout: keep-alive interval for the TCP connection.
Driver-specific socket timeouts (each library has its own).

For HTTP clients (curl, REST APIs, most BI tools):

http_send_timeout
http_receive_timeout
tcp_keep_alive_timeout
HTTP client read timeout on the caller side.

Defaults are typically 30 to 300 seconds, far less than a real OPTIMIZE FINAL.

Raise them per session:

SET send_timeout = 7200;
SET receive_timeout = 7200;
OPTIMIZE TABLE events FINAL;

Or per clickhouse-client invocation:

clickhouse-client \
  --send_timeout=7200 \
  --receive_timeout=7200 \
  -q 'OPTIMIZE TABLE events FINAL'

For HTTP:

curl --max-time 7200 \
  'http://localhost:8123/?http_receive_timeout=7200&http_send_timeout=7200' \
  --data-binary 'OPTIMIZE TABLE events FINAL'

Distributed DDL timeout

When OPTIMIZE is issued with ON CLUSTER, ClickHouse waits for every shard to acknowledge completion. This wait is bounded by distributed_ddl_task_timeout (default 180 seconds).

SET distributed_ddl_task_timeout = 7200;
OPTIMIZE TABLE events ON CLUSTER my_cluster FINAL;

Or set it in config.xml to apply globally. Note: the timeout only affects the client wait. The DDL task remains in ZooKeeper/Keeper and shards continue executing. Use system.distributed_ddl_queue to watch progress.

Monitoring an In-Progress OPTIMIZE FINAL

After the client times out, verify the merge is still running:

SELECT
    database,
    table,
    elapsed,
    progress,
    num_parts,
    formatReadableSize(total_size_bytes_compressed) AS size,
    formatReadableSize(memory_usage) AS mem
FROM system.merges
WHERE table = 'events';

progress is between 0 and 1. elapsed is in seconds. If the row is there, the merge is alive.

Check the processes view too:

SHOW PROCESSLIST;

Or with detail:

SELECT query_id, user, elapsed, query
FROM system.processes
WHERE query ILIKE '%OPTIMIZE%';

After completion, find the result in system.query_log:

SELECT
    event_time,
    query_duration_ms,
    type,
    exception
FROM system.query_log
WHERE query ILIKE '%OPTIMIZE%events%FINAL%'
ORDER BY event_time DESC
LIMIT 10;

On clusters, watch the DDL queue:

SELECT host, status, exception_text
FROM system.distributed_ddl_queue
ORDER BY entry DESC
LIMIT 20;

Running OPTIMIZE FINAL Without Blocking on the Client

To return immediately and let the merge run in the background, use mutations_sync = 0 style behavior with the alter_sync setting, or simply let the client timeout deliberately. The cleanest pattern is to issue from a clickhouse-client session inside a screen/tmux or via nohup and check system.merges.

To force synchronous execution (and raise the timeouts to match), use:

SET mutations_sync = 2;     -- wait for all replicas
SET alter_sync = 2;
SET receive_timeout = 14400;
OPTIMIZE TABLE events FINAL;

Common Pitfalls

Retrying after a client timeout. The first merge is still running. The retry queues another (or fails immediately). Look at system.merges first.
Tuning only receive_timeout when on HTTP. HTTP clients use http_receive_timeout. Native TCP clients use receive_timeout. Wrong knob, no effect.
Forgetting distributed_ddl_task_timeout on ON CLUSTER. The local timeout is irrelevant; the cluster-wide wait is the binding one.
Assuming Cancelling query. Query was cancelled killed the merge. It did not. The server keeps merging.
Running OPTIMIZE FINAL on Replicated tables without coordinating. Multiple replicas can try to merge the same partition; one wins, the others wait. Use execute_merges_on_single_replica_time_threshold if this matters.

Frequently Asked Questions

Q: Why does OPTIMIZE FINAL time out even on small tables? A: Usually because the default client timeout is short (often 30 seconds) and the table has more parts or replication queue lag than expected. Raise receive_timeout and check system.merges to see what is actually happening.

Q: If the client times out, do I need to re-run OPTIMIZE FINAL? A: Almost never. The merge keeps running on the server. Query system.merges to confirm, then wait or come back later.

Q: Which timeout matters for OPTIMIZE TABLE ... ON CLUSTER FINAL? A: distributed_ddl_task_timeout bounds how long the client waits for every shard to finish. Defaults to 180 seconds; raise it for any non-trivial ON CLUSTER operation.

Q: How do I tell whether the merge is still running after a timeout? A: SELECT * FROM system.merges WHERE table = '<your_table>'. If a row exists with non-zero progress and elapsed, the merge is alive.

Q: Can I cancel a long OPTIMIZE FINAL? A: Cancelling the client disconnects but does not stop the merge. To actually stop background merges on a replica, use SYSTEM STOP MERGES <table>. Note this affects all merges, not just the one you started.

Q: Does the same advice apply to INSERT ... SELECT timeouts? A: Yes. INSERT ... SELECT, CREATE TABLE ... AS SELECT, and CREATE MATERIALIZED VIEW ... POPULATE all keep running after client timeout. Same timeouts apply, same monitoring approach.