Elasticsearch Cluster Unstable Due to Network Timeouts

Network timeouts cause Elasticsearch cluster instability through node disconnections, master election failures, and shard reallocation storms. This guide helps diagnose and resolve network-related cluster problems.

Symptoms of Network Timeout Issues

Nodes frequently leaving and rejoining cluster
Repeated master elections
"NodeDisconnectedException" in logs
"failed to ping" or "connect timeout" errors
Shard relocation without apparent cause
Cluster status flapping between green/yellow/red

Diagnosing Network Issues

Check Cluster Logs

grep -i "timeout\|disconnect\|failed to ping\|transport\|master.*changed" /var/log/elasticsearch/*.log

Verify Node Connectivity

# From each node, test connectivity to others
ping -c 100 <other_node_ip>

# Check for packet loss
mtr --report <other_node_ip>

# Test Elasticsearch ports
nc -zv <other_node_ip> 9300  # Transport
nc -zv <other_node_ip> 9200  # HTTP

Check Cluster State

GET /_cluster/health
GET /_cat/nodes?v
GET /_cat/master?v&h=id,host,ip,node

Review Transport Statistics

GET /_nodes/stats/transport

Common Causes and Fixes

Cause 1: Network Latency Between Nodes

Symptoms: Intermittent timeouts, especially during high load

Diagnosis:

# Measure latency between nodes
ping -c 1000 <node_ip> | grep -E "min|avg|max|loss"

Solutions:

Ensure nodes are in same network segment:
- Avoid cross-datacenter clusters without proper configuration
- Use dedicated network for cluster traffic
Increase timeout settings:

# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s
cluster.fault_detection.follower_check.retry_count: 5

Cause 2: Garbage Collection Causing Timeouts

Symptoms: Timeouts correlate with GC pauses in logs

Diagnosis:

grep "gc\[" /var/log/elasticsearch/*.log | grep -E "[0-9]{4,}ms"

Solutions:

Reduce heap pressure
Tune GC settings
Increase fault detection timeouts to tolerate GC pauses

Cause 3: Firewall or Security Group Issues

Symptoms: Nodes can't connect at all or intermittently

Diagnosis:

# Check firewall rules
iptables -L -n
# or
firewall-cmd --list-all

# Check cloud security groups

Solutions:

Ensure ports 9200-9400 are open between nodes
Check for connection limits or rate limiting

Cause 4: DNS Resolution Problems

Symptoms: Intermittent failures, especially after IP changes

Solutions:

Use IP addresses in configuration:

# elasticsearch.yml
discovery.seed_hosts:
  - 192.168.1.10
  - 192.168.1.11
  - 192.168.1.12

Or ensure reliable DNS:

# /etc/hosts on each node
192.168.1.10 es-node-1
192.168.1.11 es-node-2
192.168.1.12 es-node-3

Cause 5: Network Congestion

Symptoms: Timeouts during high traffic periods

Solutions:

Separate cluster traffic:

# elasticsearch.yml
network.host: 0.0.0.0
transport.host: 192.168.2.10  # Dedicated network interface
http.host: 192.168.1.10       # Client-facing interface

Limit recovery bandwidth:

PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "50mb"
  }
}

Cause 6: MTU Mismatch

Symptoms: Large packets fail, small requests succeed

Diagnosis:

# Test different packet sizes
ping -M do -s 1472 <other_node_ip>  # Standard MTU
ping -M do -s 8972 <other_node_ip>  # Jumbo frames

Solution: Ensure consistent MTU across all nodes and network equipment.

Cause 7: TCP Keep-Alive Settings

Symptoms: Long-idle connections being dropped

Solutions:

# elasticsearch.yml
transport.tcp.keep_alive: true
transport.tcp.keep_idle: 300
transport.tcp.keep_interval: 60
transport.tcp.keep_count: 6

Recommended Network Configuration

Elasticsearch Settings

# elasticsearch.yml

# Discovery
discovery.seed_hosts:
  - 192.168.1.10:9300
  - 192.168.1.11:9300
  - 192.168.1.12:9300
cluster.initial_master_nodes:
  - master-1
  - master-2
  - master-3

# Fault detection (ES 7.x+)
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s
cluster.fault_detection.follower_check.retry_count: 5

# Transport
transport.tcp.connect_timeout: 30s
transport.tcp.compress: true
transport.tcp.keep_alive: true

# Network
network.tcp.no_delay: true
network.tcp.keep_alive: true

System-Level Settings

# /etc/sysctl.conf

# TCP keepalive settings
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 6

# Connection tracking
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 86400

# Apply changes
sysctl -p

Monitoring Network Health

Key Metrics to Track

Ping latency between nodes
Packet loss percentage
Transport layer errors
Master election frequency
Node join/leave events

Alerting Thresholds

Metric	Warning	Critical
Ping latency	> 5ms	> 20ms
Packet loss	> 0.1%	> 1%
Master elections/hour	> 1	> 5
Node disconnects/hour	> 2	> 10

Monitoring Commands

# Continuous ping monitoring
ping -i 1 <node_ip> | while read line; do echo "$(date): $line"; done >> /var/log/ping.log

# Network statistics
netstat -s | grep -i retrans

Cross-Datacenter Considerations

For geographically distributed clusters:

Use dedicated cross-datacenter links
Consider CCR (Cross-Cluster Replication) instead of single cluster
Increase all timeout values significantly:

cluster.fault_detection.leader_check.timeout: 60s
cluster.fault_detection.follower_check.timeout: 60s

Monitor WAN link health separately

Recovery After Network Event

After resolving network issues:

// 1. Check cluster health
GET /_cluster/health

// 2. Review pending tasks
GET /_cluster/pending_tasks

// 3. If recovery is slow, trigger reroute
POST /_cluster/reroute?retry_failed=true

// 4. Verify all nodes present
GET /_cat/nodes?v

Elasticsearch Cluster Unstable Due to Network Timeouts

Symptoms of Network Timeout Issues

Diagnosing Network Issues

Check Cluster Logs

Verify Node Connectivity

Check Cluster State

Review Transport Statistics

Common Causes and Fixes

Cause 1: Network Latency Between Nodes

Cause 2: Garbage Collection Causing Timeouts

Cause 3: Firewall or Security Group Issues

Cause 4: DNS Resolution Problems

Cause 5: Network Congestion

Cause 6: MTU Mismatch

Cause 7: TCP Keep-Alive Settings

Recommended Network Configuration

Elasticsearch Settings

System-Level Settings

Monitoring Network Health

Key Metrics to Track

Alerting Thresholds

Monitoring Commands

Cross-Datacenter Considerations

Recovery After Network Event

Related Topics