Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Cluster Performance Troubleshooting

Cluster-wide performance issues in Elasticsearch require a systematic approach to diagnosis and resolution. This guide covers the essential steps to identify bottlenecks and optimize your cluster's performance.

Performance Diagnostic Framework

Phase 1: Cluster Health Assessment

Start with a comprehensive health check:

GET /_cluster/health?level=indices

Key indicators:

  • status: green/yellow/red
  • number_of_pending_tasks: should be 0 or very low
  • delayed_unassigned_shards: indicates recovery issues
  • active_shards_percent_as_number: should be 100%

Phase 2: Resource Utilization Analysis

Check resource usage across all nodes:

GET /_cat/nodes?v&h=name,cpu,heap.percent,disk.used_percent,load_1m,node.role

Identify nodes with:

  • CPU > 80%
  • Heap > 85%
  • Disk > 85%
  • High load average

Phase 3: Thread Pool Analysis

Thread pools indicate where bottlenecks exist:

GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected

Critical thread pools to monitor:

  • search: Search query execution
  • write: Indexing operations
  • get: Document retrieval
  • management: Cluster management tasks

Common Cluster Performance Issues

Issue 1: Uneven Shard Distribution

Symptoms:

  • Some nodes heavily loaded while others idle
  • Inconsistent query latency

Diagnosis:

GET /_cat/allocation?v
GET /_cat/shards?v&s=store:desc

Solutions:

  • Enable shard awareness:
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone"
  }
}
  • Rebalance the cluster:
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "all"
  }
}

Issue 2: Master Node Overload

Symptoms:

  • Slow cluster state updates
  • High pending tasks
  • Master election instability

Diagnosis:

GET /_cluster/pending_tasks
GET /_cat/master?v

Solutions:

  • Use dedicated master nodes (minimum 3)
  • Reduce shard count (fewer shards = less cluster state)
  • Optimize cluster settings updates

Issue 3: Slow Cluster State Processing

Symptoms:

  • Mapping updates are slow
  • Index creation delays
  • High pending_tasks count

Diagnosis:

GET /_cluster/state?filter_path=metadata.indices.*.mappings

Solutions:

  • Simplify mappings
  • Reduce number of indices
  • Use index templates efficiently

Issue 4: Network Bottlenecks

Symptoms:

  • High latency between nodes
  • Frequent node disconnections
  • Slow shard recovery

Diagnosis:

GET /_nodes/stats/transport

Solutions:

  • Ensure nodes are in same network/availability zone
  • Increase network timeouts if needed:
# elasticsearch.yml
discovery.zen.ping_timeout: 10s
transport.tcp.connect_timeout: 30s
  • Use dedicated network interfaces for cluster traffic

Issue 5: Disk I/O Bottlenecks

Symptoms:

  • High iowait on nodes
  • Slow indexing and search
  • Segment merging delays

Diagnosis:

GET /_nodes/stats/fs

Check system metrics:

iostat -x 1 10

Solutions:

  • Use SSDs for data nodes
  • Separate data paths if using multiple disks
  • Increase refresh interval for write-heavy workloads

Performance Tuning Strategies

Optimize Shard Configuration

  • Target 10-50 GB per shard
  • Avoid oversharding (20 shards per GB of heap is deprecated guidance)
  • Use ILM for time-series data

Memory Configuration

  • Set heap to 50% of RAM (max 32 GB)
  • Leave remaining memory for filesystem cache
  • Monitor and tune GC settings

Query Optimization

  • Use filters instead of queries when possible
  • Implement proper caching strategies
  • Avoid expensive operations (wildcards, deep pagination, script scoring)

Indexing Optimization

  • Use bulk APIs for high-volume indexing
  • Tune refresh interval based on use case
  • Consider index sorting for time-series data

Monitoring and Alerting

Essential Metrics to Monitor

  1. Cluster Health: Status changes
  2. Node Resources: CPU, memory, disk, network
  3. Thread Pools: Queue depths and rejections
  4. JVM: Heap usage and GC metrics
  5. Indices: Indexing rate, search rate, latency

Recommended Alert Thresholds

Metric Warning Critical
Cluster Status Yellow Red
Heap Usage > 75% > 85%
CPU Usage > 80% > 90%
Disk Usage > 75% > 85%
Thread Pool Rejections > 0 > 100/min

Diagnostic Commands Reference

# Cluster health overview
GET /_cluster/health?pretty

# Detailed node statistics
GET /_nodes/stats

# Hot threads across cluster
GET /_nodes/hot_threads

# Shard allocation explanation
GET /_cluster/allocation/explain

# Pending cluster tasks
GET /_cluster/pending_tasks

# Index-level statistics
GET /_stats
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.