Elasticsearch Cluster Unstable Due to Network Timeouts

Network timeouts cause Elasticsearch cluster instability through node disconnections, master election failures, and shard reallocation storms. This guide helps diagnose and resolve network-related cluster problems.

Symptoms of Network Timeout Issues

  • Nodes frequently leaving and rejoining cluster
  • Repeated master elections
  • "NodeDisconnectedException" in logs
  • "failed to ping" or "connect timeout" errors
  • Shard relocation without apparent cause
  • Cluster status flapping between green/yellow/red

Diagnosing Network Issues

Check Cluster Logs

grep -i "timeout\|disconnect\|failed to ping\|transport\|master.*changed" /var/log/elasticsearch/*.log

Verify Node Connectivity

# From each node, test connectivity to others
ping -c 100 <other_node_ip>

# Check for packet loss
mtr --report <other_node_ip>

# Test Elasticsearch ports
nc -zv <other_node_ip> 9300  # Transport
nc -zv <other_node_ip> 9200  # HTTP

Check Cluster State

GET /_cluster/health
GET /_cat/nodes?v
GET /_cat/master?v&h=id,host,ip,node

Review Transport Statistics

GET /_nodes/stats/transport

Common Causes and Fixes

Cause 1: Network Latency Between Nodes

Symptoms: Intermittent timeouts, especially during high load

Diagnosis:

# Measure latency between nodes
ping -c 1000 <node_ip> | grep -E "min|avg|max|loss"

Solutions:

  1. Ensure nodes are in same network segment:

    • Avoid cross-datacenter clusters without proper configuration
    • Use dedicated network for cluster traffic
  2. Increase timeout settings:

# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s
cluster.fault_detection.follower_check.retry_count: 5

Cause 2: Garbage Collection Causing Timeouts

Symptoms: Timeouts correlate with GC pauses in logs

Diagnosis:

grep "gc\[" /var/log/elasticsearch/*.log | grep -E "[0-9]{4,}ms"

Solutions:

  • Reduce heap pressure
  • Tune GC settings
  • Increase fault detection timeouts to tolerate GC pauses

Cause 3: Firewall or Security Group Issues

Symptoms: Nodes can't connect at all or intermittently

Diagnosis:

# Check firewall rules
iptables -L -n
# or
firewall-cmd --list-all

# Check cloud security groups

Solutions:

  • Ensure ports 9200-9400 are open between nodes
  • Check for connection limits or rate limiting

Cause 4: DNS Resolution Problems

Symptoms: Intermittent failures, especially after IP changes

Solutions:

  1. Use IP addresses in configuration:
# elasticsearch.yml
discovery.seed_hosts:
  - 192.168.1.10
  - 192.168.1.11
  - 192.168.1.12
  1. Or ensure reliable DNS:
# /etc/hosts on each node
192.168.1.10 es-node-1
192.168.1.11 es-node-2
192.168.1.12 es-node-3

Cause 5: Network Congestion

Symptoms: Timeouts during high traffic periods

Solutions:

  1. Separate cluster traffic:
# elasticsearch.yml
network.host: 0.0.0.0
transport.host: 192.168.2.10  # Dedicated network interface
http.host: 192.168.1.10       # Client-facing interface
  1. Limit recovery bandwidth:
PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "50mb"
  }
}

Cause 6: MTU Mismatch

Symptoms: Large packets fail, small requests succeed

Diagnosis:

# Test different packet sizes
ping -M do -s 1472 <other_node_ip>  # Standard MTU
ping -M do -s 8972 <other_node_ip>  # Jumbo frames

Solution: Ensure consistent MTU across all nodes and network equipment.

Cause 7: TCP Keep-Alive Settings

Symptoms: Long-idle connections being dropped

Solutions:

# elasticsearch.yml
transport.tcp.keep_alive: true
transport.tcp.keep_idle: 300
transport.tcp.keep_interval: 60
transport.tcp.keep_count: 6

Elasticsearch Settings

# elasticsearch.yml

# Discovery
discovery.seed_hosts:
  - 192.168.1.10:9300
  - 192.168.1.11:9300
  - 192.168.1.12:9300
cluster.initial_master_nodes:
  - master-1
  - master-2
  - master-3

# Fault detection (ES 7.x+)
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s
cluster.fault_detection.follower_check.retry_count: 5

# Transport
transport.tcp.connect_timeout: 30s
transport.tcp.compress: true
transport.tcp.keep_alive: true

# Network
network.tcp.no_delay: true
network.tcp.keep_alive: true

System-Level Settings

# /etc/sysctl.conf

# TCP keepalive settings
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 6

# Connection tracking
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 86400

# Apply changes
sysctl -p

Monitoring Network Health

Key Metrics to Track

  • Ping latency between nodes
  • Packet loss percentage
  • Transport layer errors
  • Master election frequency
  • Node join/leave events

Alerting Thresholds

Metric Warning Critical
Ping latency > 5ms > 20ms
Packet loss > 0.1% > 1%
Master elections/hour > 1 > 5
Node disconnects/hour > 2 > 10

Monitoring Commands

# Continuous ping monitoring
ping -i 1 <node_ip> | while read line; do echo "$(date): $line"; done >> /var/log/ping.log

# Network statistics
netstat -s | grep -i retrans

Cross-Datacenter Considerations

For geographically distributed clusters:

  1. Use dedicated cross-datacenter links
  2. Consider CCR (Cross-Cluster Replication) instead of single cluster
  3. Increase all timeout values significantly:
cluster.fault_detection.leader_check.timeout: 60s
cluster.fault_detection.follower_check.timeout: 60s
  1. Monitor WAN link health separately

Recovery After Network Event

After resolving network issues:

// 1. Check cluster health
GET /_cluster/health

// 2. Review pending tasks
GET /_cluster/pending_tasks

// 3. If recovery is slow, trigger reroute
POST /_cluster/reroute?retry_failed=true

// 4. Verify all nodes present
GET /_cat/nodes?v

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.