Elasticsearch performance issues can manifest in various ways, from slow queries to high resource utilization. This guide provides a systematic approach to identifying and resolving common performance problems in Elasticsearch clusters.
Common Performance Issue Categories
1. Query Performance Issues
- Slow search responses
- High query latency
- Timeout errors during searches
2. Indexing Performance Issues
- Slow document indexing
- Bulk request failures
- High indexing latency
3. Resource Utilization Issues
- High CPU usage
- Memory pressure
- Disk I/O bottlenecks
- Network saturation
Diagnostic Steps
Step 1: Check Cluster Health
Start by verifying the overall cluster health:
GET /_cluster/health
A yellow or red status indicates underlying issues that may contribute to performance problems.
Step 2: Identify Hot Threads
Use the hot threads API to identify CPU-intensive operations:
GET /_nodes/hot_threads
This reveals which threads are consuming the most CPU time and what operations they're performing.
Step 3: Review Node Statistics
Check resource utilization across all nodes:
GET /_nodes/stats
Pay attention to:
- JVM heap usage and garbage collection metrics
- Thread pool queue sizes and rejections
- Disk I/O statistics
- Network metrics
Step 4: Analyze Slow Queries
Enable slow query logging to identify problematic queries:
PUT /my-index/_settings
{
"index.search.slowlog.threshold.query.warn": "10s",
"index.search.slowlog.threshold.query.info": "5s",
"index.search.slowlog.threshold.query.debug": "2s",
"index.search.slowlog.threshold.query.trace": "500ms"
}
Step 5: Check Pending Tasks
Review pending cluster tasks that may indicate bottlenecks:
GET /_cluster/pending_tasks
Common Causes and Solutions
Too Many Shards
Symptoms: High memory usage, slow cluster state updates, degraded search performance
Solution: Reduce shard count by:
- Using appropriate shard sizing (10-50 GB per shard)
- Implementing index lifecycle management (ILM)
- Consolidating small indices
Inefficient Queries
Symptoms: Slow query responses, high CPU usage during searches
Solution:
- Avoid wildcard queries at the beginning of terms
- Use filters instead of queries where possible
- Implement pagination properly (avoid deep pagination)
- Reduce aggregation bucket sizes
Insufficient Resources
Symptoms: High resource utilization, frequent garbage collection
Solution:
- Scale vertically (more memory, faster disks)
- Scale horizontally (add more nodes)
- Use SSDs instead of HDDs
- Ensure heap is set appropriately (no more than 50% of RAM, max 32 GB)
Disk I/O Bottlenecks
Symptoms: High iowait, slow indexing and searches
Solution:
- Use SSDs for data nodes
- Increase the refresh interval for write-heavy workloads
- Ensure adequate filesystem cache (50% of RAM for OS cache)
Monitoring Best Practices
- Set up continuous monitoring using tools like Kibana Stack Monitoring, Prometheus, or Datadog
- Create alerts for key metrics:
- JVM heap usage > 85%
- Thread pool rejections
- Cluster status changes
- Disk usage > 80%
- Establish baselines to understand normal performance patterns
- Monitor queue depths - ideally queues should be near empty
Performance Tuning Checklist
- Heap size is 50% of RAM (max 32 GB)
- Using SSDs for data storage
- Shards sized between 10-50 GB
- Slow query logging enabled
- Monitoring and alerting configured
- Index lifecycle management implemented
- Query patterns optimized
- Bulk indexing used for high-volume writes