ClickHouse DB::Exception: Read timeout / Timeout exceeded (Code 209)

Code: 209. DB::NetException: Timeout exceeded while reading from socket (... 300000 ms) or Timeout exceeded while receiving data from client. This ClickHouse error fires when a TCP read on the native or interserver connection blocks longer than the configured receive_timeout (default 300 seconds). It is distinct from max_execution_time, which kills the query itself with a different error - code 209 is purely a network-level wait. The query may still complete on the server even after the client times out.

What This Error Means

ClickHouse imposes timeouts at three different layers: socket-level (receive_timeout, send_timeout, default 300s each), per-query execution (max_execution_time, default 0 = unlimited), and HTTP-level (http_receive_timeout, http_send_timeout, default 30s for / and 1800s for /play). Code 209 (SOCKET_TIMEOUT) maps to the socket layer - either the native TCP read or the inter-server hop between replicas/shards exceeded its wait window.

The most common situations are: a long-running query whose result block did not arrive within receive_timeout (the server is still working but the client gave up); a distributed sub-query where one shard is slow and the coordinator's read from that shard times out; or an INSERT from a slow producer whose silence on the socket exceeds the server-side receive_timeout. In all three the underlying problem is "data did not flow within the deadline" - not "the query failed."

Common Causes

  1. A slow SELECT whose result blocks take longer to compute than receive_timeout. Confirm with system.query_log after the timeout - the query may still be QueryStart without a matching QueryFinish.
  2. A distributed query where one shard is overloaded. Confirm with system.processes on each shard - one will show an elapsed time matching the timeout.
  3. Network packet loss or path MTU issues between client and server, dropping reads silently. Confirm with tcpdump, mtr, or by checking Last_IO_Error style metrics in your proxy.
  4. An HTTP client hitting the 30-second http_receive_timeout on / rather than the 300-second native TCP receive_timeout. Confirm by checking which port the client connected to (8123 vs 9000) and the message wording.
  5. A streaming INSERT from a producer that pauses longer than receive_timeout between blocks. Confirm by tracing the producer's flush cadence.
  6. Inter-replica replication fetch timing out while pulling parts. Confirm with last_exception in system.replication_queue.

How to Fix Read Timeout

  1. Identify the timeout layer. Look at the message - from socket is native TCP; while receiving data from client is server-side receive_timeout; HTTP error pages with 504 are HTTP timeouts.

  2. Check whether the query is still running:

    SELECT query_id, elapsed, user, query
    FROM system.processes
    WHERE elapsed > 30 ORDER BY elapsed DESC;
    
  3. Raise the client timeout if the query genuinely needs longer than 300 seconds. For clickhouse-client:

    clickhouse-client --receive_timeout 1800 --send_timeout 1800 --query="..."
    

    For JDBC: socket_timeout=1800000 (ms). For clickhouse-driver: connect_timeout=10, send_receive_timeout=1800.

  4. Cap the query with max_execution_time so it fails cleanly server-side rather than orphaning a long query when the client gives up:

    SET max_execution_time = 600;  -- 10 minutes
    
  5. Optimize the slow query. Use EXPLAIN PIPELINE to find blocking operators; add a partition filter, reduce join size, or push aggregation into AggregatingMergeTree. The read_rows and read_bytes columns in system.query_log show whether the query scanned more than necessary.

  6. Investigate distributed-query slowness. Use system.clusters to confirm shard health and run the same query against each shard's local table to find the laggard.

  7. For HTTP clients: switch to the native TCP protocol (port 9000) or increase HTTP timeouts in config.xml:

    <http_receive_timeout>1800</http_receive_timeout>
    <http_send_timeout>1800</http_send_timeout>
    

Root-Cause Analysis

To find which queries are timing out and why, correlate client errors with the server's query log:

-- Queries that started but never finished in the last day - likely client timeouts
SELECT q1.query_id, q1.user, q1.event_time AS started, q1.query
FROM system.query_log q1
WHERE q1.event_date >= today() - 1 AND q1.type = 'QueryStart'
  AND NOT exists(
      SELECT 1 FROM system.query_log q2
      WHERE q2.query_id = q1.query_id AND q2.type IN ('QueryFinish', 'ExceptionWhileProcessing')
  )
ORDER BY started DESC LIMIT 50;

-- Slowest finished queries (potential next-timeout candidates)
SELECT query_duration_ms, read_rows, memory_usage, query
FROM system.query_log
WHERE event_date = today() AND type = 'QueryFinish'
ORDER BY query_duration_ms DESC LIMIT 20;

Preventive Measures

  • Always set max_execution_time on user-facing query paths. Without it, a slow query stays alive on the server long after the client has timed out, occupying a query slot and risking a too many simultaneous queries error.
  • Configure client timeouts longer than max_execution_time so the server is the one that decides whether to kill a query.
  • Monitor system.metric_log for CurrentMetric_TCPConnection and CurrentMetric_HTTPConnection to catch connection storms before they cause queueing-induced timeouts.
  • Watch system.events for NetworkReceiveElapsedMicroseconds and NetworkSendElapsedMicroseconds - sustained growth signals network or upstream issues.
  • For distributed clusters, enforce distributed_connections_pool_size and connect_timeout_with_failover_ms to fail fast on a dead shard rather than waiting the full 300 seconds.

Resolve Code 209 SOCKET_TIMEOUT Automatically with Pulse

Pulse is an AI DBA for ClickHouse (and Kafka and Elasticsearch). When Code: 209. DB::NetException: Timeout exceeded while reading from socket fires in your environment, the underlying cause can be socket-level (receive_timeout, default 300s), execution-level (max_execution_time), or HTTP-level (http_receive_timeout) - Pulse:

  • Continuously tracks per-query query_duration_ms from system.query_log, orphaned QueryStart rows without a matching QueryFinish, and the three timeout layers (receive_timeout, send_timeout, http_receive_timeout, max_execution_time)
  • Correlates client-side 209 errors with the corresponding query_id server-side, per-replica CurrentMetric_Query saturation, NetworkReceiveElapsedMicroseconds/NetworkSendElapsedMicroseconds trends, and distributed sub-query elapsed time across shards
  • Identifies which of the six causes above applies - slow SELECT, overloaded shard in a distributed query, network packet loss, HTTP-vs-native protocol mismatch on port 8123 vs 9000, streaming INSERT pause, or inter-replica replication fetch lag
  • Recommends the precise fix - raise receive_timeout/send_timeout on the client, set max_execution_time = 600 server-side, add a partition filter or route via AggregatingMergeTree, or tune distributed_connections_pool_size and connect_timeout_with_failover_ms
  • Applies low-risk fixes automatically with your approval (rerouting traffic away from a saturated replica while diagnostics run) or generates a one-click config PR

Pulse turns the manual orphan-query and per-shard triage above into an agentic SRE workflow. Start a free trial.

Frequently Asked Questions

Q: What is the fastest way to diagnose Code 209 read timeouts in production ClickHouse?
A: First identify the timeout layer from the wording - from socket is native TCP, while receiving data from client is server-side receive_timeout, and HTTP 504 is the HTTP layer. Then check system.processes for the orphaned query_id still running server-side. For continuous coverage, Pulse is an AI DBA for ClickHouse that correlates client 209 errors with orphan queries in system.query_log, per-replica saturation, and network elapsed-time metrics, and recommends whether to raise client timeouts, set max_execution_time, or rewrite the query.

Q: What does "DB::Exception: Read timeout" mean in ClickHouse?
A: It means a TCP read between the client and ClickHouse (or between two ClickHouse servers) exceeded receive_timeout (default 300 seconds). The error code is 209 (SOCKET_TIMEOUT). It does not necessarily mean the query failed - the server may still be running it.

Q: How do I increase the read timeout in ClickHouse?
A: For clickhouse-client, pass --receive_timeout 1800 and --send_timeout 1800 (seconds). For JDBC, set socket_timeout in milliseconds. For HTTP clients, raise http_receive_timeout/http_send_timeout in config.xml. Always pair longer client timeouts with max_execution_time server-side so queries do not run forever.

Q: What is the difference between receive_timeout and max_execution_time?
A: receive_timeout is a socket-level wait - "no bytes arrived for N seconds." max_execution_time is a server-side query duration cap - "this query has been running too long." A query can hit max_execution_time and fail with Code: 159 TIMEOUT_EXCEEDED, or it can be running fine but the client's receive_timeout fires because the result block is just slow to compute.

Q: Why does my query work in clickhouse-client but time out from my application?
A: Almost always the application has a shorter default timeout. JDBC drivers default to 30s, HTTP clients often to 60s, while clickhouse-client defaults to 300s. Check and raise the application's socket_timeout (or equivalent), and confirm the application is hitting the same port (8123 HTTP vs 9000 native).

Q: Does the read timeout kill the query on the server?
A: No. A receive_timeout is a client-side or socket-side error - the server keeps running the query. To make sure the server cancels the query when the client gives up, set cancel_http_readonly_queries_on_client_close = 1 for HTTP, and rely on max_execution_time for native TCP.

Q: Can I get partial results from ClickHouse when a timeout fires?
A: Yes, with SET partial_result_on_first_cancel = 1 or by using LIMIT to bound the result. Without those, ClickHouse returns an error and discards any partial result.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.