Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Cluster Unstable Due to Network Timeouts

Network timeouts cause Elasticsearch cluster instability through node disconnections, master election failures, and shard reallocation storms. This guide helps diagnose and resolve network-related cluster problems.

Symptoms of Network Timeout Issues

  • Nodes frequently leaving and rejoining cluster
  • Repeated master elections
  • "NodeDisconnectedException" in logs
  • "failed to ping" or "connect timeout" errors
  • Shard relocation without apparent cause
  • Cluster status flapping between green/yellow/red

Diagnosing Network Issues

Check Cluster Logs

grep -i "timeout\|disconnect\|failed to ping\|transport\|master.*changed" /var/log/elasticsearch/*.log

Verify Node Connectivity

# From each node, test connectivity to others
ping -c 100 <other_node_ip>

# Check for packet loss
mtr --report <other_node_ip>

# Test Elasticsearch ports
nc -zv <other_node_ip> 9300  # Transport
nc -zv <other_node_ip> 9200  # HTTP

Check Cluster State

GET /_cluster/health
GET /_cat/nodes?v
GET /_cat/master?v&h=id,host,ip,node

Review Transport Statistics

GET /_nodes/stats/transport

Common Causes and Fixes

Cause 1: Network Latency Between Nodes

Symptoms: Intermittent timeouts, especially during high load

Diagnosis:

# Measure latency between nodes
ping -c 1000 <node_ip> | grep -E "min|avg|max|loss"

Solutions:

  1. Ensure nodes are in same network segment:

    • Avoid cross-datacenter clusters without proper configuration
    • Use dedicated network for cluster traffic
  2. Increase timeout settings:

# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s
cluster.fault_detection.follower_check.retry_count: 5

Cause 2: Garbage Collection Causing Timeouts

Symptoms: Timeouts correlate with GC pauses in logs

Diagnosis:

grep "gc\[" /var/log/elasticsearch/*.log | grep -E "[0-9]{4,}ms"

Solutions:

  • Reduce heap pressure
  • Tune GC settings
  • Increase fault detection timeouts to tolerate GC pauses

Cause 3: Firewall or Security Group Issues

Symptoms: Nodes can't connect at all or intermittently

Diagnosis:

# Check firewall rules
iptables -L -n
# or
firewall-cmd --list-all

# Check cloud security groups

Solutions:

  • Ensure ports 9200-9400 are open between nodes
  • Check for connection limits or rate limiting

Cause 4: DNS Resolution Problems

Symptoms: Intermittent failures, especially after IP changes

Solutions:

  1. Use IP addresses in configuration:
# elasticsearch.yml
discovery.seed_hosts:
  - 192.168.1.10
  - 192.168.1.11
  - 192.168.1.12
  1. Or ensure reliable DNS:
# /etc/hosts on each node
192.168.1.10 es-node-1
192.168.1.11 es-node-2
192.168.1.12 es-node-3

Cause 5: Network Congestion

Symptoms: Timeouts during high traffic periods

Solutions:

  1. Separate cluster traffic:
# elasticsearch.yml
network.host: 0.0.0.0
transport.host: 192.168.2.10  # Dedicated network interface
http.host: 192.168.1.10       # Client-facing interface
  1. Limit recovery bandwidth:
PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "50mb"
  }
}

Cause 6: MTU Mismatch

Symptoms: Large packets fail, small requests succeed

Diagnosis:

# Test different packet sizes
ping -M do -s 1472 <other_node_ip>  # Standard MTU
ping -M do -s 8972 <other_node_ip>  # Jumbo frames

Solution: Ensure consistent MTU across all nodes and network equipment.

Cause 7: TCP Keep-Alive Settings

Symptoms: Long-idle connections being dropped

Solutions:

# elasticsearch.yml
transport.tcp.keep_alive: true
transport.tcp.keep_idle: 300
transport.tcp.keep_interval: 60
transport.tcp.keep_count: 6

Elasticsearch Settings

# elasticsearch.yml

# Discovery
discovery.seed_hosts:
  - 192.168.1.10:9300
  - 192.168.1.11:9300
  - 192.168.1.12:9300
cluster.initial_master_nodes:
  - master-1
  - master-2
  - master-3

# Fault detection (ES 7.x+)
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s
cluster.fault_detection.follower_check.retry_count: 5

# Transport
transport.tcp.connect_timeout: 30s
transport.tcp.compress: true
transport.tcp.keep_alive: true

# Network
network.tcp.no_delay: true
network.tcp.keep_alive: true

System-Level Settings

# /etc/sysctl.conf

# TCP keepalive settings
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 6

# Connection tracking
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 86400

# Apply changes
sysctl -p

Monitoring Network Health

Key Metrics to Track

  • Ping latency between nodes
  • Packet loss percentage
  • Transport layer errors
  • Master election frequency
  • Node join/leave events

Alerting Thresholds

Metric Warning Critical
Ping latency > 5ms > 20ms
Packet loss > 0.1% > 1%
Master elections/hour > 1 > 5
Node disconnects/hour > 2 > 10

Monitoring Commands

# Continuous ping monitoring
ping -i 1 <node_ip> | while read line; do echo "$(date): $line"; done >> /var/log/ping.log

# Network statistics
netstat -s | grep -i retrans

Cross-Datacenter Considerations

For geographically distributed clusters:

  1. Use dedicated cross-datacenter links
  2. Consider CCR (Cross-Cluster Replication) instead of single cluster
  3. Increase all timeout values significantly:
cluster.fault_detection.leader_check.timeout: 60s
cluster.fault_detection.follower_check.timeout: 60s
  1. Monitor WAN link health separately

Recovery After Network Event

After resolving network issues:

// 1. Check cluster health
GET /_cluster/health

// 2. Review pending tasks
GET /_cluster/pending_tasks

// 3. If recovery is slow, trigger reroute
POST /_cluster/reroute?retry_failed=true

// 4. Verify all nodes present
GET /_cat/nodes?v
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.