When an Elasticsearch cluster crashes due to high CPU usage, it's critical to identify the root cause quickly to restore service and prevent future occurrences. This guide provides a systematic approach to diagnosing CPU-related cluster crashes.
Identifying the Crash Pattern
Check Cluster Logs
Review logs from all nodes around the crash time:
# Recent errors in Elasticsearch logs
grep -i "error\|exception\|crash" /var/log/elasticsearch/elasticsearch.log | tail -100
Common patterns indicating CPU-related issues:
java.lang.OutOfMemoryError(GC overhead)ThreadPool queue is fullTransport node disconnectedMaster not discovered
Analyze Node State Before Crash
If the cluster is still accessible or partially running:
GET /_nodes/hot_threads?threads=10
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=rejected:desc
Root Causes of CPU-Induced Crashes
Cause 1: Runaway Queries
Description: Complex or poorly optimized queries consuming all CPU resources.
Indicators:
- Hot threads showing search operations
- Specific indices with high search activity
- Sudden query pattern changes
Investigation:
GET /_tasks?actions=*search*&detailed=true
Solution:
// Set index-level search throttling
PUT /problematic-index/_settings
{
"index.search.slowlog.threshold.query.warn": "5s",
"index.search.throttle.max_bytes_per_sec": "100mb"
}
Cause 2: Merge Storm
Description: Multiple segment merges happening simultaneously, especially after heavy indexing.
Indicators:
- Hot threads showing Lucene merge operations
- Recent bulk indexing activity
- Many small segments
Investigation:
GET /_cat/segments?v&s=index
GET /_nodes/stats/indices/merges
Solution:
PUT /my-index/_settings
{
"index.merge.scheduler.max_thread_count": 1,
"index.merge.policy.max_merged_segment": "5gb"
}
Cause 3: GC Death Spiral
Description: Excessive garbage collection consuming CPU, leading to timeouts and cascading failures.
Indicators:
- GC logs showing long pauses (>30 seconds)
- Heap usage near 100%
- Young generation GC frequency extremely high
Investigation:
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc
Solution:
- Reduce heap pressure (see memory optimization)
- Scale horizontally to distribute load
- Review heap sizing
Cause 4: Cluster State Processing
Description: Large cluster state updates (mapping changes, index creation) overwhelming master and data nodes.
Indicators:
- High pending tasks count
- Master node CPU spike
- Recent bulk index or mapping operations
Investigation:
GET /_cluster/pending_tasks
GET /_cluster/state?filter_path=metadata.cluster_uuid,version
Solution:
- Reduce number of indices and shards
- Use index templates to avoid runtime mapping updates
- Ensure dedicated master nodes
Cause 5: Split Brain or Master Election Storm
Description: Network issues causing repeated master elections, each consuming significant CPU.
Indicators:
- Multiple master changes in logs
- Network timeout errors
- Inconsistent cluster state across nodes
Investigation:
GET /_cat/master?v
# Check logs for master election events
Solution:
# elasticsearch.yml - ensure proper discovery
discovery.seed_hosts: ["node1", "node2", "node3"]
cluster.initial_master_nodes: ["master1", "master2", "master3"]
Immediate Recovery Steps
Step 1: Stabilize the Cluster
If nodes are still running but struggling:
// Temporarily disable shard allocation
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "none"
}
}
Step 2: Identify and Cancel Problematic Tasks
GET /_tasks?detailed=true&group_by=parents
// Cancel specific task
POST /_tasks/{task_id}/_cancel
Step 3: Reduce Incoming Load
- Pause indexing clients
- Add query throttling at the application layer
- Redirect traffic away from struggling nodes
Step 4: Restart Affected Nodes
After identifying the root cause:
# Graceful node restart
systemctl stop elasticsearch
# Wait for node to leave cluster
systemctl start elasticsearch
Step 5: Re-enable Normal Operations
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
Prevention Strategies
Implement Resource Limits
# elasticsearch.yml
indices.breaker.total.limit: 70%
indices.breaker.request.limit: 60%
thread_pool.search.queue_size: 1000
thread_pool.write.queue_size: 500
Set Query Timeouts
// Default timeout for all searches
PUT /_cluster/settings
{
"persistent": {
"search.default_search_timeout": "30s"
}
}
Monitor and Alert
Set up alerts for:
- CPU usage > 85% for more than 5 minutes
- Thread pool rejections increasing
- GC time > 10% of total time
- Master node changes
Capacity Planning
- Maintain headroom for traffic spikes (target 50-60% average CPU)
- Plan for failure scenarios (N+1 or N+2 capacity)
- Regularly review and optimize query patterns
Post-Mortem Analysis
After recovering from a crash:
- Collect all relevant logs from the crash period
- Identify the trigger - what changed before the crash?
- Review metrics - CPU, memory, disk I/O leading up to the crash
- Document findings and implement preventive measures
- Test fixes in a staging environment