ClickHouse Replication and Distributed Connection Error Notes

Distributed queries and replication both rely on TCP connections between ClickHouse nodes. When a node becomes unreachable, the server records the failure in metrics such as ClickHouseDistributedConnectionExceptions, ClickHouseReplicatedPartChecksFailed, and ClickHouseReplicatedPartFailedFetches. Low rates are routine, but sustained or rapidly escalating counts point to a network, DNS, or replica health problem that needs operator attention.

This page collects the diagnostic queries that surface the failing host so you can move from an alert to a fix without guessing.

What the Metrics Mean

Metric	Meaning	Severity
`ClickHouseDistributedConnectionExceptions`	A distributed query could not open a connection to a remote shard	High when sustained, indicates network or node down
`ClickHouseReplicatedPartChecksFailed`	A part checksum check failed between replicas	Low at low rates, investigate above baseline
`ClickHouseReplicatedPartFailedFetches`	A replica could not fetch a part from another replica	Low at low rates, watch for trends

Brief spikes during a deployment, restart, or network blip usually resolve on their own. Persistent growth means a node is unhealthy or a network path is broken.

Confirm the Cluster Topology Responds

A quick way to verify the cluster is reachable in both directions is to ask every replica to query every other replica:

SELECT count() FROM clusterAllReplicas('{cluster}', cluster('{cluster}', system.one));

If the cluster is fully connected, the count equals (number_of_replicas)^2. A lower number means some host pair cannot establish a connection. Replace {cluster} with the cluster name from system.clusters.

Identify the Failing Hosts

system.clusters records per-replica error counts maintained by the connection pool. Query it across every replica to see which node sees the failures and which host it cannot reach:

SELECT hostName(), *
FROM clusterAllReplicas('{cluster}', system.clusters)
WHERE errors_count > 0;

hostName() is the reporter, while host_name in the row is the unreachable target. A one-sided error pattern (host A reports errors talking to B, but B reports none) often points to firewall or routing changes on the path from A to B.

Look at Recent Server-Side Errors

system.errors records every server error with a timestamp. Use a recent window to focus on the current incident:

SELECT hostName(), *
FROM clusterAllReplicas('{cluster}', system.errors)
WHERE last_error_time > now() - 3600
ORDER BY value;

Look for repeated entries with names such as NETWORK_ERROR, SOCKET_TIMEOUT, DNS_ERROR, or KEEPER_EXCEPTION. The value column counts occurrences, so a few thousand entries within the last hour means an ongoing problem rather than a one-off blip.

Inspect Replication State

For ReplicatedMergeTree tables, the queue and replica status live in two system tables:

SELECT database, table, queue_size, inserts_in_queue, merges_in_queue,
       log_max_index, log_pointer, absolute_delay
FROM clusterAllReplicas('{cluster}', system.replicas)
WHERE queue_size > 0 OR absolute_delay > 60
ORDER BY queue_size DESC;

SELECT database, table, type, num_tries, last_exception, create_time
FROM clusterAllReplicas('{cluster}', system.replication_queue)
WHERE num_tries > 3
ORDER BY num_tries DESC;

A high num_tries with a populated last_exception is the most actionable signal. The exception text usually identifies whether the issue is a missing part, a checksum mismatch, or a connection failure.

Standard Triage Steps

Confirm every cluster node is up and the ClickHouse process is responding to SELECT 1.
Validate DNS resolution and routing between nodes, especially after a recent network change.
Review system.replicas for absolute_delay and queue_size outliers.
Tail clickhouse-server.log and clickhouse-server.err.log on the affected host for stack traces.
If a single replica is stuck and the others are healthy, run SYSTEM RESTART REPLICA db_name.table_name on the affected node to reinitialize the replica session.

SYSTEM RESTART REPLICA db_name.table_name;

If the queue is empty cluster-wide but the metric counter is still climbing, the failure is on the distributed query path rather than replication. Check the firewall, kernel limits, and any sidecar proxy between hosts.

Common Pitfalls

Restarting a replica clears in-memory state and forces a re-fetch of metadata from ZooKeeper. Do not run it as the first response, it is a recovery action.
clusterAllReplicas requires the user to have access on every replica with the same credentials. A missing user on one node looks like a connection error.
system.errors resets when the server restarts, so the absence of recent errors right after a restart is not proof the issue is fixed.
Rapid retries on ClickHouseReplicatedPartFailedFetches can saturate the network. Throttle by lowering background_fetches_pool_size if a single failing part is consuming bandwidth.
DNS caching in containers can keep stale IPs alive after a node moves. Confirm the resolver returns the current IP.

Frequently Asked Questions

Q: Are ClickHouseReplicatedPartChecksFailed and ClickHouseReplicatedPartFailedFetches always a problem? A: At low volumes they are routine, replicas check parts continuously and occasionally retry. Persistent growth or correlation with a specific table or replica is the signal that requires investigation.

Q: How do I tell if the problem is the network or ClickHouse itself? A: Test TCP connectivity from the reporting host to the target on port 9000 (or your configured port) with nc -vz. If TCP succeeds, the issue is inside ClickHouse, ZooKeeper, or authentication. If TCP fails, the issue is network or firewall.

Q: Will the queue catch up on its own once the network is fixed? A: Yes, in most cases. ClickHouse retries replication tasks automatically. If a referenced part is permanently gone from every replica, or if num_tries keeps climbing on the same task with a fatal last_exception, manual intervention is needed.

Q: When should I run SYSTEM RESTART REPLICA? A: When a single replica is stuck after the underlying network issue is resolved and the queue is not draining. The command re-establishes the ZooKeeper session and refreshes part metadata.

Q: How do I clear the error counters in system.errors? A: Counters reset only on server restart, but they are informational. Watch the trend in the last hour rather than absolute values.