CPU spikes on Elasticsearch nodes can cause query timeouts, indexing delays, and cluster instability. This guide provides a systematic approach to investigating and resolving sudden CPU increases.
Identifying CPU Spikes
Monitor Current CPU Usage
GET /_cat/nodes?v&h=name,cpu,load_1m,load_5m,load_15m,heap.percent
Historical Pattern Analysis
Check if spikes correlate with:
- Specific times of day
- Application deployments
- User traffic patterns
- Scheduled jobs (backups, index operations)
Investigation Workflow
Step 1: Capture Hot Threads
Immediately when spike occurs:
GET /_nodes/hot_threads?threads=10&interval=500ms
Multiple captures help identify patterns:
for i in {1..5}; do
curl -s "localhost:9200/_nodes/hot_threads?threads=10" >> cpu_spike_$(date +%s).txt
sleep 10
done
Step 2: Check Running Tasks
GET /_tasks?detailed=true&group_by=parents
Look for:
- Long-running searches
- Large bulk operations
- Force merge operations
Step 3: Review Thread Pools
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=active:desc
High activity indicates which operations are consuming resources.
Step 4: Check Recent Queries
# Enable slow logs if not already
PUT /_all/_settings
{
"index.search.slowlog.threshold.query.warn": "5s"
}
# Check slow query log
tail -f /var/log/elasticsearch/*_index_search_slowlog.log
Common Causes of CPU Spikes
Cause 1: Expensive Search Queries
Indicators:
- Hot threads showing
searchworkers - High search thread pool activity
- Slow query log entries
Common culprits:
- Leading wildcards:
*term - Complex regex patterns
- Deep pagination
- Large aggregations
- Script queries on every document
Investigation:
GET /_tasks?actions=*search*&detailed=true
Solution:
// Cancel problematic query
POST /_tasks/{task_id}/_cancel
// Add query timeout
{
"timeout": "30s",
"query": { ... }
}
Cause 2: Bulk Indexing Operations
Indicators:
- Hot threads showing
writeorbulkoperations - High write thread pool activity
- Recent application deployments
Investigation:
GET /_tasks?actions=indices:data/write*&detailed=true
Solutions:
- Reduce bulk batch size
- Add delays between bulk requests
- Use dedicated ingest nodes
Cause 3: Segment Merging
Indicators:
- Hot threads showing Lucene merge operations
- Generic thread pool busy
- Recent heavy indexing
Investigation:
GET /_nodes/stats/indices/merges
GET /_cat/segments?v&index=*&s=segments.count:desc
Solutions:
PUT /my-index/_settings
{
"index.merge.scheduler.max_thread_count": 1
}
Cause 4: Shard Recovery or Rebalancing
Indicators:
- Recent node restart or addition
- Cluster status yellow
- Recovery operations in progress
Investigation:
GET /_cat/recovery?v&active_only=true
Solutions:
PUT /_cluster/settings
{
"persistent": {
"indices.recovery.max_bytes_per_sec": "50mb"
}
}
Cause 5: Garbage Collection
Indicators:
- Hot threads showing GC
- High heap usage
- GC entries in logs
Investigation:
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc
Solutions:
- Reduce heap pressure
- Scale the cluster
- Review memory-intensive operations
Cause 6: Cluster State Updates
Indicators:
- Hot threads on master node
- Pending tasks increasing
- Recent mapping or index changes
Investigation:
GET /_cluster/pending_tasks
GET /_cat/master?v
Solutions:
- Reduce number of indices/shards
- Batch index template updates
- Use dedicated master nodes
Real-Time Monitoring During Spikes
System Metrics
# Real-time CPU per process
top -p $(pgrep -d',' -f elasticsearch)
# CPU breakdown
mpstat -P ALL 1
# Per-thread CPU (find Elasticsearch threads)
ps -eLf | grep elasticsearch | head -20
Elasticsearch APIs
# Combined script for investigation
watch -n 5 'curl -s "localhost:9200/_cat/nodes?v&h=name,cpu,heap.percent,load_1m"'
Proactive Monitoring
Set Up Alerts
Alert when:
- CPU > 80% for more than 5 minutes
- Load average > 2x CPU core count
- Search thread pool rejections > 0
Capacity Planning
- Maintain 40-50% CPU headroom for spikes
- Plan for 2x expected traffic
- Regular load testing
Investigation Checklist
When CPU spike occurs:
- Capture hot threads (multiple times)
- Check running tasks
- Review thread pool statistics
- Check slow query logs
- Check GC metrics
- Review recovery operations
- Check pending cluster tasks
- Correlate with application events
- Document findings for future reference
Post-Spike Analysis
After resolving the spike:
- Root cause analysis: Document what caused the spike
- Implement prevention: Query governance, resource limits
- Update monitoring: Add alerts for early detection
- Review capacity: Ensure adequate headroom