Elasticsearch Network Bandwidth Saturation

Elasticsearch is a distributed system where every search query, indexed document, shard recovery, and cluster state update travels over the network. When bandwidth saturates, the symptoms are deceptive - they look like node failures, master instability, or slow queries rather than a network problem. Understanding where bandwidth goes and how to control it is the difference between a stable cluster and one that cascades into repeated master elections.

How Elasticsearch Consumes Network Bandwidth

Bandwidth consumption comes from five main sources. Shard replication is the constant background load: every document indexed into a primary shard gets forwarded to each replica over the transport layer. A bulk request hitting a coordinating node fans out to every data node holding a target primary, and each of those forwards to replica holders. With a replication factor of 1, effective write bandwidth doubles. With 2 replicas, it triples.

Shard recovery is the largest burst consumer. When a node rejoins or shards rebalance, Elasticsearch copies entire Lucene segment files between nodes. A 50GB shard means 50GB of network transfer per recovery, and multiple shards recover in parallel by default. Snapshot operations to remote repositories (S3, GCS, Azure Blob) also compete for bandwidth.

Search scatter-gather adds up under high query rates. A search against an index with 20 shards fans out to all 20, each returning partial results to the coordinating node. At thousands of queries per second the aggregate bandwidth is substantial.

Symptoms of Network Saturation

The first sign is usually transport-level timeouts in the logs: [transport] [node-3] transport connection timed out or similar messages. The default transport.connect_timeout is 30 seconds, so if these appear, the network is severely congested or experiencing packet loss.

Master instability is the most damaging symptom. The elected master sends periodic follower check pings to every node. If a node fails to respond within cluster.fault_detection.follower_check.timeout (default 10s) for cluster.fault_detection.follower_check.retry_count consecutive attempts (default 3), the master removes it from the cluster. On a saturated network, these pings get queued behind large recovery or replication payloads. The node is healthy, but the master cannot reach it. The removed node triggers shard recovery, generating more traffic, causing more follower check failures. This feedback loop can take down an entire cluster.

Slow shard recovery is another indicator. If recoveries stretch from minutes to hours, network throughput is likely the bottleneck rather than disk I/O. Check _cat/recovery and compare the bytes_recovered rate against expected network throughput.

Monitoring with _nodes/stats/transport

The _nodes/stats/transport endpoint exposes cumulative byte counters for each node's transport layer traffic. The two fields to watch are tx_size_in_bytes and rx_size_in_bytes, which track total bytes sent and received since the node started.

GET _nodes/stats/transport

Poll this endpoint at regular intervals and compute the delta to get throughput. If a data node shows 800MB/s of tx_size growth during recovery and your interface is 1Gbps (roughly 125MB/s effective), you have found the bottleneck. Comparing tx and rx across nodes reveals asymmetry - a node with high tx is likely serving as a recovery source for multiple shards.

Pair this with OS-level tools like iftop, nload, or Prometheus node_exporter's node_network_transmit_bytes_total for full interface utilization. Elasticsearch transport stats only cover cluster-internal traffic; HTTP client traffic and snapshot uploads use the same physical interface unless you have dedicated NICs.

Throttling Recovery and Rebalancing

The primary control for recovery bandwidth is indices.recovery.max_bytes_per_sec, which limits per-node recovery throughput. The default is 40MB/s. On 10Gbps clusters, raise it to 100-200MB/s to speed up recoveries. On clusters where recovery competes with production traffic on a single 1Gbps interface, lowering to 20MB/s prevents recovery from starving search and indexing.

PUT _cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}

The setting cluster.routing.allocation.node_concurrent_recoveries controls how many shards recover simultaneously per node (default: 2). Setting it to 4 with a 40MB/s recovery limit means up to 160MB/s of recovery traffic on that node. Adjust these two settings together.

Bulk request sizing also matters. A 100MB bulk request generates roughly 100MB of fan-out traffic plus another 100MB per replica. Smaller bulk requests (5-15MB) spread the load more evenly. If bulk queue rejects correlate with network spikes, your payloads are too large for available bandwidth.

Network Architecture Recommendations

Production clusters should run on 10Gbps networking at minimum. A three-node cluster with 1Gbps interfaces can saturate during a single node recovery while serving production traffic. With 10Gbps, recovery and production traffic coexist without contention in most workloads.

Dedicated network interfaces for transport traffic separate cluster-internal communication from client-facing HTTP traffic. Configure transport.bind_host and http.bind_host to point to different interfaces. This prevents bulk indexing bursts from delaying follower check pings on the transport layer.

Place all nodes in the same network segment when possible. For multi-zone deployments, use shard allocation awareness (cluster.routing.allocation.awareness.attributes) so that replicas land in different zones while search traffic stays zone-local. Monitor inter-zone bandwidth costs on cloud providers - cross-AZ transfer charges add up quickly when recovery traffic flows between zones.