ClickHouse DB::Exception: Cannot write to socket

Q: How is this different from CANNOT_READ_FROM_SOCKET?

CANNOT_WRITE_TO_SOCKET means the sending side failed, while CANNOT_READ_FROM_SOCKET means the receiving side failed. Both point to network problems, but from different perspectives. The troubleshooting approach is similar for both.

The "DB::Exception: Cannot write to socket" error in ClickHouse occurs when the server or client is unable to send data over a network socket. This CANNOT_WRITE_TO_SOCKET error typically means the connection between ClickHouse and its peer (a client, another replica, or a remote server) was interrupted during an active data transfer. Unlike query logic errors, this one is rooted in network or connection-layer problems.

Impact

Socket write failures affect ClickHouse operations in several ways:

Queries in progress will be aborted and return an error to the client
Distributed queries across shards may fail if inter-node communication is disrupted
Replication traffic between replicas can be interrupted, causing sync delays
Client applications will receive incomplete or no results for the affected query
Repeated failures may indicate a systemic network problem affecting overall cluster health

Common Causes

Client disconnected prematurely (closed the connection before the server finished sending data)
Network interruption between ClickHouse nodes or between server and client
Firewall or load balancer timeout closing idle or long-running connections
TCP keepalive settings too aggressive, causing connections to be dropped
Remote peer crashed or became unreachable during data transfer
Network interface saturation or packet loss under heavy load
MTU mismatch causing packet fragmentation and delivery failures

Troubleshooting and Resolution Steps

Check whether the client is still connected: If the client application disconnected (crashed, timed out, or user cancelled), this error is expected. Verify client-side logs for connection closures.
Test network connectivity between nodes:
```
ping -c 5 remote-clickhouse-host
traceroute remote-clickhouse-host
```
Look for packet loss or high latency.
Check for firewall or load balancer timeouts: If queries run longer than the load balancer's idle timeout, the connection will be severed. Increase timeout values or configure TCP keepalive:
```

<tcp_keep_alive_timeout>60</tcp_keep_alive_timeout>
```
Inspect network interface statistics:
```
netstat -s | grep -i "error\|drop\|overflow"
ip -s link show eth0
```
High error or drop counts suggest network-level issues.

Review ClickHouse server logs for context:

grep "Cannot write to socket" /var/log/clickhouse-server/clickhouse-server.err.log | tail -10

Correlate timestamps with client or network events.

For distributed queries, check shard connectivity:
```
SELECT * FROM system.clusters;
```
Verify that all shards and replicas are reachable. Test connectivity to each node individually.

Increase send/receive buffer sizes if needed:

sysctl net.core.wmem_max
sysctl net.core.rmem_max

Larger buffers can help absorb transient network slowdowns:

sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.core.rmem_max=16777216

Best Practices

Implement retry logic with exponential backoff in client applications for transient network errors
Configure TCP keepalive on both ClickHouse and any intermediate load balancers to detect dead connections early
Use connection pooling in client drivers to manage connections efficiently
Monitor network health between all ClickHouse cluster nodes with regular checks
Set appropriate query timeouts so that clients do not wait indefinitely
For long-running queries over unreliable networks, consider breaking the work into smaller batches

Frequently Asked Questions

Q: Is data lost when this error occurs during an INSERT?
A: If the socket fails mid-insert, the entire insert block is typically rolled back. ClickHouse uses atomic inserts at the block level, so you won't end up with partial data. Simply retry the insert.

Q: Why do I see this error when the network seems fine?
A: The most common hidden cause is a load balancer or proxy silently closing the connection after its idle timeout expires. Check all intermediate network devices, not just the endpoints.

Q: Can this error occur between replicas?
A: Yes. Replication data transfer and distributed query execution both use TCP sockets. Network issues between replicas will produce this error and may delay replication until connectivity is restored.

Q: How is this different from CANNOT_READ_FROM_SOCKET?
A: CANNOT_WRITE_TO_SOCKET means the sending side failed, while CANNOT_READ_FROM_SOCKET means the receiving side failed. Both point to network problems, but from different perspectives. The troubleshooting approach is similar for both.