Cluster-wide performance issues in Elasticsearch require a systematic approach to diagnosis and resolution. This guide covers the essential steps to identify bottlenecks and optimize your cluster's performance.
Performance Diagnostic Framework
Phase 1: Cluster Health Assessment
Start with a comprehensive health check:
GET /_cluster/health?level=indices
Key indicators:
- status: green/yellow/red
- number_of_pending_tasks: should be 0 or very low
- delayed_unassigned_shards: indicates recovery issues
- active_shards_percent_as_number: should be 100%
Phase 2: Resource Utilization Analysis
Check resource usage across all nodes:
GET /_cat/nodes?v&h=name,cpu,heap.percent,disk.used_percent,load_1m,node.role
Identify nodes with:
- CPU > 80%
- Heap > 85%
- Disk > 85%
- High load average
Phase 3: Thread Pool Analysis
Thread pools indicate where bottlenecks exist:
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected
Critical thread pools to monitor:
search: Search query executionwrite: Indexing operationsget: Document retrievalmanagement: Cluster management tasks
Common Cluster Performance Issues
Issue 1: Uneven Shard Distribution
Symptoms:
- Some nodes heavily loaded while others idle
- Inconsistent query latency
Diagnosis:
GET /_cat/allocation?v
GET /_cat/shards?v&s=store:desc
Solutions:
- Enable shard awareness:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "zone"
}
}
- Rebalance the cluster:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.rebalance.enable": "all"
}
}
Issue 2: Master Node Overload
Symptoms:
- Slow cluster state updates
- High pending tasks
- Master election instability
Diagnosis:
GET /_cluster/pending_tasks
GET /_cat/master?v
Solutions:
- Use dedicated master nodes (minimum 3)
- Reduce shard count (fewer shards = less cluster state)
- Optimize cluster settings updates
Issue 3: Slow Cluster State Processing
Symptoms:
- Mapping updates are slow
- Index creation delays
- High
pending_taskscount
Diagnosis:
GET /_cluster/state?filter_path=metadata.indices.*.mappings
Solutions:
- Simplify mappings
- Reduce number of indices
- Use index templates efficiently
Issue 4: Network Bottlenecks
Symptoms:
- High latency between nodes
- Frequent node disconnections
- Slow shard recovery
Diagnosis:
GET /_nodes/stats/transport
Solutions:
- Ensure nodes are in same network/availability zone
- Increase network timeouts if needed:
# elasticsearch.yml
discovery.zen.ping_timeout: 10s
transport.tcp.connect_timeout: 30s
- Use dedicated network interfaces for cluster traffic
Issue 5: Disk I/O Bottlenecks
Symptoms:
- High iowait on nodes
- Slow indexing and search
- Segment merging delays
Diagnosis:
GET /_nodes/stats/fs
Check system metrics:
iostat -x 1 10
Solutions:
- Use SSDs for data nodes
- Separate data paths if using multiple disks
- Increase refresh interval for write-heavy workloads
Performance Tuning Strategies
Optimize Shard Configuration
- Target 10-50 GB per shard
- Avoid oversharding (20 shards per GB of heap is deprecated guidance)
- Use ILM for time-series data
Memory Configuration
- Set heap to 50% of RAM (max 32 GB)
- Leave remaining memory for filesystem cache
- Monitor and tune GC settings
Query Optimization
- Use filters instead of queries when possible
- Implement proper caching strategies
- Avoid expensive operations (wildcards, deep pagination, script scoring)
Indexing Optimization
- Use bulk APIs for high-volume indexing
- Tune refresh interval based on use case
- Consider index sorting for time-series data
Monitoring and Alerting
Essential Metrics to Monitor
- Cluster Health: Status changes
- Node Resources: CPU, memory, disk, network
- Thread Pools: Queue depths and rejections
- JVM: Heap usage and GC metrics
- Indices: Indexing rate, search rate, latency
Recommended Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
| Cluster Status | Yellow | Red |
| Heap Usage | > 75% | > 85% |
| CPU Usage | > 80% | > 90% |
| Disk Usage | > 75% | > 85% |
| Thread Pool Rejections | > 0 | > 100/min |
Diagnostic Commands Reference
# Cluster health overview
GET /_cluster/health?pretty
# Detailed node statistics
GET /_nodes/stats
# Hot threads across cluster
GET /_nodes/hot_threads
# Shard allocation explanation
GET /_cluster/allocation/explain
# Pending cluster tasks
GET /_cluster/pending_tasks
# Index-level statistics
GET /_stats