Elasticsearch Performance Issues Troubleshooting

Elasticsearch performance issues can manifest in various ways, from slow queries to high resource utilization. This guide provides a systematic approach to identifying and resolving common performance problems in Elasticsearch clusters.

Common Performance Issue Categories

1. Query Performance Issues

Slow search responses
High query latency
Timeout errors during searches

2. Indexing Performance Issues

Slow document indexing
Bulk request failures
High indexing latency

3. Resource Utilization Issues

High CPU usage
Memory pressure
Disk I/O bottlenecks
Network saturation

Diagnostic Steps

Step 1: Check Cluster Health

Start by verifying the overall cluster health:

GET /_cluster/health

A yellow or red status indicates underlying issues that may contribute to performance problems.

Step 2: Identify Hot Threads

Use the hot threads API to identify CPU-intensive operations:

GET /_nodes/hot_threads

This reveals which threads are consuming the most CPU time and what operations they're performing.

Step 3: Review Node Statistics

Check resource utilization across all nodes:

GET /_nodes/stats

Pay attention to:

JVM heap usage and garbage collection metrics
Thread pool queue sizes and rejections
Disk I/O statistics
Network metrics

Step 4: Analyze Slow Queries

Enable slow query logging to identify problematic queries:

PUT /my-index/_settings
{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "5s",
  "index.search.slowlog.threshold.query.debug": "2s",
  "index.search.slowlog.threshold.query.trace": "500ms"
}

Step 5: Check Pending Tasks

Review pending cluster tasks that may indicate bottlenecks:

GET /_cluster/pending_tasks

Common Causes and Solutions

Too Many Shards

Symptoms: High memory usage, slow cluster state updates, degraded search performance

Solution: Reduce shard count by:

Using appropriate shard sizing (10-50 GB per shard)
Implementing index lifecycle management (ILM)
Consolidating small indices

Inefficient Queries

Symptoms: Slow query responses, high CPU usage during searches

Solution:

Avoid wildcard queries at the beginning of terms
Use filters instead of queries where possible
Implement pagination properly (avoid deep pagination)
Reduce aggregation bucket sizes

Insufficient Resources

Symptoms: High resource utilization, frequent garbage collection

Solution:

Scale vertically (more memory, faster disks)
Scale horizontally (add more nodes)
Use SSDs instead of HDDs
Ensure heap is set appropriately (no more than 50% of RAM, max 32 GB)

Disk I/O Bottlenecks

Symptoms: High iowait, slow indexing and searches

Solution:

Use SSDs for data nodes
Increase the refresh interval for write-heavy workloads
Ensure adequate filesystem cache (50% of RAM for OS cache)

Monitoring Best Practices

Set up continuous monitoring using tools like Kibana Stack Monitoring, Prometheus, or Datadog
Create alerts for key metrics:
- JVM heap usage > 85%
- Thread pool rejections
- Cluster status changes
- Disk usage > 80%
Establish baselines to understand normal performance patterns
Monitor queue depths - ideally queues should be near empty

Performance Tuning Checklist

Heap size is 50% of RAM (max 32 GB)
Using SSDs for data storage
Shards sized between 10-50 GB
Slow query logging enabled
Monitoring and alerting configured
Index lifecycle management implemented
Query patterns optimized
Bulk indexing used for high-volume writes

Elasticsearch Performance Issues Troubleshooting

Common Performance Issue Categories

1. Query Performance Issues

2. Indexing Performance Issues

3. Resource Utilization Issues

Diagnostic Steps

Step 1: Check Cluster Health

Step 2: Identify Hot Threads

Step 3: Review Node Statistics

Step 4: Analyze Slow Queries

Step 5: Check Pending Tasks

Common Causes and Solutions

Too Many Shards

Inefficient Queries

Insufficient Resources

Disk I/O Bottlenecks

Monitoring Best Practices

Performance Tuning Checklist

Additional Resources