The "DB::Exception: Timeout exceeded" error in ClickHouse is a generic timeout indicator that can arise in several contexts: query execution, connection establishment, lock acquisition, or inter-node communication. The TIMEOUT_EXCEEDED error code signals that an operation did not complete within its allotted time, regardless of the specific timeout type.
Impact
The affected operation is terminated, and no result is returned. Depending on the context, this can manifest as failed queries, dropped client connections, stalled replication, or blocked DDL operations. In distributed setups, a timeout on one node can cascade into failures across the entire query if not handled properly.
Common Causes
- Query execution exceeding
max_execution_timeor related time limits - Connection timeouts when ClickHouse cannot reach a remote shard or ZooKeeper within the configured window
- Lock wait timeouts when DDL operations or mutations wait for exclusive access to a table
- Distributed query timeouts caused by slow or unresponsive shards
- ZooKeeper session timeouts in replicated table environments
- Network issues causing inter-node communication delays
- Heavy server load causing operations to stall beyond their time limits
Troubleshooting and Resolution Steps
Identify the timeout type from the full error message. The context varies significantly:
-- Check timeout-related settings SELECT name, value FROM system.settings WHERE name LIKE '%timeout%';For query execution timeouts, increase
max_execution_time:SET max_execution_time = 300; SELECT ...;For connection timeouts, check network connectivity and adjust settings:
-- Key connection timeout settings SELECT name, value FROM system.settings WHERE name IN ('connect_timeout', 'receive_timeout', 'send_timeout', 'connect_timeout_with_failover_ms');Increase if network latency is high:
SET connect_timeout = 30; SET receive_timeout = 300; SET send_timeout = 300;For distributed query timeouts, check shard health:
-- Check cluster status SELECT * FROM system.clusters; -- Check for slow replicas SELECT * FROM system.replicas WHERE is_session_expired = 1 OR future_parts > 10;For lock wait timeouts, identify blocking operations:
-- Check active processes that might hold locks SELECT query_id, query, elapsed FROM system.processes ORDER BY elapsed DESC; -- Check pending mutations SELECT * FROM system.mutations WHERE is_done = 0;For ZooKeeper timeouts, verify ZooKeeper cluster health:
echo ruok | nc zookeeper-host 2181 echo stat | nc zookeeper-host 2181Monitor network health between ClickHouse nodes and between ClickHouse and ZooKeeper:
ping -c 5 remote-clickhouse-node traceroute remote-clickhouse-node
Best Practices
- Configure timeout values appropriate for your network topology and workload characteristics.
- Use different timeout profiles for interactive queries (short) and batch operations (long).
- Monitor ZooKeeper latency and session health proactively, as ZooKeeper issues often manifest as timeouts.
- Implement retry logic in client applications for transient timeout errors.
- Use
max_execution_time_leafto set per-shard timeouts in distributed queries, preventing a single slow shard from blocking the entire query. - Keep ClickHouse nodes and ZooKeeper in the same network zone to minimize latency.
Frequently Asked Questions
Q: How do I distinguish between different types of timeouts?
A: The full error message and stack trace provide context. Query timeouts typically mention max_execution_time, connection timeouts reference specific host addresses, and lock timeouts mention the table or operation being waited on. Checking system.query_log also helps identify the timeout source.
Q: Can a timeout on one shard cause the entire distributed query to fail?
A: Yes. By default, a distributed query fails if any shard times out. You can use skip_unavailable_shards = 1 to continue with available shards, though results will be incomplete.
Q: Should I set very high timeouts to avoid this error?
A: Not as a general practice. Timeouts exist to prevent runaway operations from consuming resources indefinitely. Instead, investigate why operations are slow and address the root cause, while setting timeouts to reasonable values for your environment.
Q: Do timeouts apply to INSERT operations?
A: Yes. Insert operations can time out due to send_timeout, connection issues, or lock waits if another operation holds an exclusive lock on the target table. Long-running inserts into replicated tables may also time out waiting for ZooKeeper acknowledgment.