Network timeouts cause Elasticsearch cluster instability through node disconnections, master election failures, and shard reallocation storms. This guide helps diagnose and resolve network-related cluster problems.
Symptoms of Network Timeout Issues
- Nodes frequently leaving and rejoining cluster
- Repeated master elections
- "NodeDisconnectedException" in logs
- "failed to ping" or "connect timeout" errors
- Shard relocation without apparent cause
- Cluster status flapping between green/yellow/red
Diagnosing Network Issues
Check Cluster Logs
grep -i "timeout\|disconnect\|failed to ping\|transport\|master.*changed" /var/log/elasticsearch/*.log
Verify Node Connectivity
# From each node, test connectivity to others
ping -c 100 <other_node_ip>
# Check for packet loss
mtr --report <other_node_ip>
# Test Elasticsearch ports
nc -zv <other_node_ip> 9300 # Transport
nc -zv <other_node_ip> 9200 # HTTP
Check Cluster State
GET /_cluster/health
GET /_cat/nodes?v
GET /_cat/master?v&h=id,host,ip,node
Review Transport Statistics
GET /_nodes/stats/transport
Common Causes and Fixes
Cause 1: Network Latency Between Nodes
Symptoms: Intermittent timeouts, especially during high load
Diagnosis:
# Measure latency between nodes
ping -c 1000 <node_ip> | grep -E "min|avg|max|loss"
Solutions:
Ensure nodes are in same network segment:
- Avoid cross-datacenter clusters without proper configuration
- Use dedicated network for cluster traffic
Increase timeout settings:
# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s
cluster.fault_detection.follower_check.retry_count: 5
Cause 2: Garbage Collection Causing Timeouts
Symptoms: Timeouts correlate with GC pauses in logs
Diagnosis:
grep "gc\[" /var/log/elasticsearch/*.log | grep -E "[0-9]{4,}ms"
Solutions:
- Reduce heap pressure
- Tune GC settings
- Increase fault detection timeouts to tolerate GC pauses
Cause 3: Firewall or Security Group Issues
Symptoms: Nodes can't connect at all or intermittently
Diagnosis:
# Check firewall rules
iptables -L -n
# or
firewall-cmd --list-all
# Check cloud security groups
Solutions:
- Ensure ports 9200-9400 are open between nodes
- Check for connection limits or rate limiting
Cause 4: DNS Resolution Problems
Symptoms: Intermittent failures, especially after IP changes
Solutions:
- Use IP addresses in configuration:
# elasticsearch.yml
discovery.seed_hosts:
- 192.168.1.10
- 192.168.1.11
- 192.168.1.12
- Or ensure reliable DNS:
# /etc/hosts on each node
192.168.1.10 es-node-1
192.168.1.11 es-node-2
192.168.1.12 es-node-3
Cause 5: Network Congestion
Symptoms: Timeouts during high traffic periods
Solutions:
- Separate cluster traffic:
# elasticsearch.yml
network.host: 0.0.0.0
transport.host: 192.168.2.10 # Dedicated network interface
http.host: 192.168.1.10 # Client-facing interface
- Limit recovery bandwidth:
PUT /_cluster/settings
{
"persistent": {
"indices.recovery.max_bytes_per_sec": "50mb"
}
}
Cause 6: MTU Mismatch
Symptoms: Large packets fail, small requests succeed
Diagnosis:
# Test different packet sizes
ping -M do -s 1472 <other_node_ip> # Standard MTU
ping -M do -s 8972 <other_node_ip> # Jumbo frames
Solution: Ensure consistent MTU across all nodes and network equipment.
Cause 7: TCP Keep-Alive Settings
Symptoms: Long-idle connections being dropped
Solutions:
# elasticsearch.yml
transport.tcp.keep_alive: true
transport.tcp.keep_idle: 300
transport.tcp.keep_interval: 60
transport.tcp.keep_count: 6
Recommended Network Configuration
Elasticsearch Settings
# elasticsearch.yml
# Discovery
discovery.seed_hosts:
- 192.168.1.10:9300
- 192.168.1.11:9300
- 192.168.1.12:9300
cluster.initial_master_nodes:
- master-1
- master-2
- master-3
# Fault detection (ES 7.x+)
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.interval: 2s
cluster.fault_detection.leader_check.retry_count: 5
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.interval: 2s
cluster.fault_detection.follower_check.retry_count: 5
# Transport
transport.tcp.connect_timeout: 30s
transport.tcp.compress: true
transport.tcp.keep_alive: true
# Network
network.tcp.no_delay: true
network.tcp.keep_alive: true
System-Level Settings
# /etc/sysctl.conf
# TCP keepalive settings
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 6
# Connection tracking
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_tcp_timeout_established = 86400
# Apply changes
sysctl -p
Monitoring Network Health
Key Metrics to Track
- Ping latency between nodes
- Packet loss percentage
- Transport layer errors
- Master election frequency
- Node join/leave events
Alerting Thresholds
| Metric | Warning | Critical |
|---|---|---|
| Ping latency | > 5ms | > 20ms |
| Packet loss | > 0.1% | > 1% |
| Master elections/hour | > 1 | > 5 |
| Node disconnects/hour | > 2 | > 10 |
Monitoring Commands
# Continuous ping monitoring
ping -i 1 <node_ip> | while read line; do echo "$(date): $line"; done >> /var/log/ping.log
# Network statistics
netstat -s | grep -i retrans
Cross-Datacenter Considerations
For geographically distributed clusters:
- Use dedicated cross-datacenter links
- Consider CCR (Cross-Cluster Replication) instead of single cluster
- Increase all timeout values significantly:
cluster.fault_detection.leader_check.timeout: 60s
cluster.fault_detection.follower_check.timeout: 60s
- Monitor WAN link health separately
Recovery After Network Event
After resolving network issues:
// 1. Check cluster health
GET /_cluster/health
// 2. Review pending tasks
GET /_cluster/pending_tasks
// 3. If recovery is slow, trigger reroute
POST /_cluster/reroute?retry_failed=true
// 4. Verify all nodes present
GET /_cat/nodes?v