The "DB::Exception: Cannot write to socket" error in ClickHouse occurs when the server or client is unable to send data over a network socket. This CANNOT_WRITE_TO_SOCKET error typically means the connection between ClickHouse and its peer (a client, another replica, or a remote server) was interrupted during an active data transfer. Unlike query logic errors, this one is rooted in network or connection-layer problems.
Impact
Socket write failures affect ClickHouse operations in several ways:
- Queries in progress will be aborted and return an error to the client
- Distributed queries across shards may fail if inter-node communication is disrupted
- Replication traffic between replicas can be interrupted, causing sync delays
- Client applications will receive incomplete or no results for the affected query
- Repeated failures may indicate a systemic network problem affecting overall cluster health
Common Causes
- Client disconnected prematurely (closed the connection before the server finished sending data)
- Network interruption between ClickHouse nodes or between server and client
- Firewall or load balancer timeout closing idle or long-running connections
- TCP keepalive settings too aggressive, causing connections to be dropped
- Remote peer crashed or became unreachable during data transfer
- Network interface saturation or packet loss under heavy load
- MTU mismatch causing packet fragmentation and delivery failures
Troubleshooting and Resolution Steps
Check whether the client is still connected: If the client application disconnected (crashed, timed out, or user cancelled), this error is expected. Verify client-side logs for connection closures.
Test network connectivity between nodes:
ping -c 5 remote-clickhouse-host traceroute remote-clickhouse-hostLook for packet loss or high latency.
Check for firewall or load balancer timeouts: If queries run longer than the load balancer's idle timeout, the connection will be severed. Increase timeout values or configure TCP keepalive:
<!-- In ClickHouse server config --> <tcp_keep_alive_timeout>60</tcp_keep_alive_timeout>Inspect network interface statistics:
netstat -s | grep -i "error\|drop\|overflow" ip -s link show eth0High error or drop counts suggest network-level issues.
Review ClickHouse server logs for context:
grep "Cannot write to socket" /var/log/clickhouse-server/clickhouse-server.err.log | tail -10Correlate timestamps with client or network events.
For distributed queries, check shard connectivity:
SELECT * FROM system.clusters;Verify that all shards and replicas are reachable. Test connectivity to each node individually.
Increase send/receive buffer sizes if needed:
sysctl net.core.wmem_max sysctl net.core.rmem_maxLarger buffers can help absorb transient network slowdowns:
sudo sysctl -w net.core.wmem_max=16777216 sudo sysctl -w net.core.rmem_max=16777216
Best Practices
- Implement retry logic with exponential backoff in client applications for transient network errors
- Configure TCP keepalive on both ClickHouse and any intermediate load balancers to detect dead connections early
- Use connection pooling in client drivers to manage connections efficiently
- Monitor network health between all ClickHouse cluster nodes with regular checks
- Set appropriate query timeouts so that clients do not wait indefinitely
- For long-running queries over unreliable networks, consider breaking the work into smaller batches
Frequently Asked Questions
Q: Is data lost when this error occurs during an INSERT?
A: If the socket fails mid-insert, the entire insert block is typically rolled back. ClickHouse uses atomic inserts at the block level, so you won't end up with partial data. Simply retry the insert.
Q: Why do I see this error when the network seems fine?
A: The most common hidden cause is a load balancer or proxy silently closing the connection after its idle timeout expires. Check all intermediate network devices, not just the endpoints.
Q: Can this error occur between replicas?
A: Yes. Replication data transfer and distributed query execution both use TCP sockets. Network issues between replicas will produce this error and may delay replication until connectivity is restored.
Q: How is this different from CANNOT_READ_FROM_SOCKET?
A: CANNOT_WRITE_TO_SOCKET means the sending side failed, while CANNOT_READ_FROM_SOCKET means the receiving side failed. Both point to network problems, but from different perspectives. The troubleshooting approach is similar for both.