Elasticsearch crashes due to high CPU usage can severely impact your cluster's availability and performance. This guide helps you identify the root causes of CPU-related crashes and implement effective solutions.
Symptoms of CPU-Related Crashes
- Nodes becoming unresponsive
- Timeout errors in client applications
- Cluster instability or master election issues
- Garbage collection pauses
- Thread pool rejections increasing
Diagnosing High CPU Usage
Step 1: Identify CPU Consumption
Check CPU usage across all nodes:
GET /_nodes/stats/os
Look for the cpu.percent field to identify nodes with high CPU usage.
Step 2: Analyze Hot Threads
The hot threads API is crucial for identifying what's consuming CPU:
GET /_nodes/hot_threads?threads=10&interval=500ms
Common findings include:
- Search threads executing complex queries
- Merge threads during segment merging
- Garbage collection threads
- Bulk indexing operations
Step 3: Check Thread Pool Statistics
Review thread pool queues and rejections:
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected
High queue values or rejections indicate overload.
Step 4: Review Recent Queries
Use the slow log or tasks API to identify expensive queries:
GET /_tasks?detailed=true&actions=*search
Common Causes and Solutions
1. Expensive Search Queries
Problem: Complex queries consuming excessive CPU
Indicators:
- Hot threads showing search workers
- High query latency
- Search thread pool saturation
Solutions:
- Enable slow query logging to identify problematic queries
- Optimize queries:
- Avoid leading wildcards (
*term) - Use filters instead of queries for non-scoring clauses
- Limit aggregation bucket sizes
- Avoid deep pagination (use
search_afterinstead)
- Avoid leading wildcards (
- Implement query timeouts:
{
"timeout": "30s",
"query": { ... }
}
2. Heavy Indexing Load
Problem: Bulk indexing operations overwhelming the cluster
Solutions:
- Reduce bulk request sizes (optimal: 5-15 MB per request)
- Increase refresh interval during heavy indexing:
PUT /my-index/_settings
{
"index.refresh_interval": "30s"
}
- Use multiple indexing clients to distribute load
- Consider dedicated ingest nodes
3. Segment Merging
Problem: Background merge operations consuming CPU
Solutions:
- Adjust merge policy settings:
PUT /my-index/_settings
{
"index.merge.scheduler.max_thread_count": 1
}
- Schedule force merges during off-peak hours
- Use time-based indices with ILM
4. Garbage Collection Pressure
Problem: Frequent GC cycles consuming CPU
Indicators:
- High GC overhead in logs
- JVM heap pressure > 85%
Solutions:
- Review and optimize heap size
- Reduce memory-intensive operations
- Add more nodes to distribute load
5. Too Many Shards
Problem: Excessive shards creating coordination overhead
Solutions:
- Consolidate small indices
- Implement proper shard sizing (10-50 GB per shard)
- Use ILM to manage index lifecycle
Preventive Measures
Set Resource Limits
Configure circuit breakers to prevent runaway operations:
# elasticsearch.yml
indices.breaker.total.limit: 70%
indices.breaker.request.limit: 60%
Implement Query Governance
- Set default timeouts for all queries
- Use query validation before execution
- Implement rate limiting for search APIs
Monitor Proactively
Set up alerts for:
- CPU usage > 80% sustained for 5+ minutes
- Thread pool rejections
- Slow query frequency increases
Recovery Steps After a Crash
- Check cluster health:
GET /_cluster/health - Review logs: Check
elasticsearch.logfor error messages - Identify the cause: Use hot threads and slow logs
- Stabilize the cluster: Cancel problematic tasks if needed
POST /_tasks/{task_id}/_cancel
- Implement fixes: Address the root cause before resuming normal operations