Elasticsearch Node CPU Spikes Investigation

CPU spikes on Elasticsearch nodes can cause query timeouts, indexing delays, and cluster instability. This guide provides a systematic approach to investigating and resolving sudden CPU increases.

Identifying CPU Spikes

Monitor Current CPU Usage

GET /_cat/nodes?v&h=name,cpu,load_1m,load_5m,load_15m,heap.percent

Historical Pattern Analysis

Check if spikes correlate with:

Specific times of day
Application deployments
User traffic patterns
Scheduled jobs (backups, index operations)

Investigation Workflow

Step 1: Capture Hot Threads

Immediately when spike occurs:

GET /_nodes/hot_threads?threads=10&interval=500ms

Multiple captures help identify patterns:

for i in {1..5}; do
  curl -s "localhost:9200/_nodes/hot_threads?threads=10" >> cpu_spike_$(date +%s).txt
  sleep 10
done

Step 2: Check Running Tasks

GET /_tasks?detailed=true&group_by=parents

Look for:

Long-running searches
Large bulk operations
Force merge operations

Step 3: Review Thread Pools

GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=active:desc

High activity indicates which operations are consuming resources.

Step 4: Check Recent Queries

# Enable slow logs if not already
PUT /_all/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s"
}

# Check slow query log
tail -f /var/log/elasticsearch/*_index_search_slowlog.log

Common Causes of CPU Spikes

Cause 1: Expensive Search Queries

Indicators:

Hot threads showing search workers
High search thread pool activity
Slow query log entries

Common culprits:

Leading wildcards: *term
Complex regex patterns
Deep pagination
Large aggregations
Script queries on every document

Investigation:

GET /_tasks?actions=*search*&detailed=true

Solution:

// Cancel problematic query
POST /_tasks/{task_id}/_cancel

// Add query timeout
{
  "timeout": "30s",
  "query": { ... }
}

Cause 2: Bulk Indexing Operations

Indicators:

Hot threads showing write or bulk operations
High write thread pool activity
Recent application deployments

Investigation:

GET /_tasks?actions=indices:data/write*&detailed=true

Solutions:

Reduce bulk batch size
Add delays between bulk requests
Use dedicated ingest nodes

Cause 3: Segment Merging

Indicators:

Hot threads showing Lucene merge operations
Generic thread pool busy
Recent heavy indexing

Investigation:

GET /_nodes/stats/indices/merges
GET /_cat/segments?v&index=*&s=segments.count:desc

Solutions:

PUT /my-index/_settings
{
  "index.merge.scheduler.max_thread_count": 1
}

Cause 4: Shard Recovery or Rebalancing

Indicators:

Recent node restart or addition
Cluster status yellow
Recovery operations in progress

Investigation:

GET /_cat/recovery?v&active_only=true

Solutions:

PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "50mb"
  }
}

Cause 5: Garbage Collection

Indicators:

Hot threads showing GC
High heap usage
GC entries in logs

Investigation:

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Solutions:

Reduce heap pressure
Scale the cluster
Review memory-intensive operations

Cause 6: Cluster State Updates

Indicators:

Hot threads on master node
Pending tasks increasing
Recent mapping or index changes

Investigation:

GET /_cluster/pending_tasks
GET /_cat/master?v

Solutions:

Reduce number of indices/shards
Batch index template updates
Use dedicated master nodes

Real-Time Monitoring During Spikes

System Metrics

# Real-time CPU per process
top -p $(pgrep -d',' -f elasticsearch)

# CPU breakdown
mpstat -P ALL 1

# Per-thread CPU (find Elasticsearch threads)
ps -eLf | grep elasticsearch | head -20

Elasticsearch APIs

# Combined script for investigation
watch -n 5 'curl -s "localhost:9200/_cat/nodes?v&h=name,cpu,heap.percent,load_1m"'

Proactive Monitoring

Set Up Alerts

Alert when:

CPU > 80% for more than 5 minutes
Load average > 2x CPU core count
Search thread pool rejections > 0

Capacity Planning

Maintain 40-50% CPU headroom for spikes
Plan for 2x expected traffic
Regular load testing

Investigation Checklist

When CPU spike occurs:

Capture hot threads (multiple times)
Check running tasks
Review thread pool statistics
Check slow query logs
Check GC metrics
Review recovery operations
Check pending cluster tasks
Correlate with application events
Document findings for future reference

Post-Spike Analysis

After resolving the spike:

Root cause analysis: Document what caused the spike
Implement prevention: Query governance, resource limits
Update monitoring: Add alerts for early detection
Review capacity: Ensure adequate headroom

Elasticsearch Node CPU Spikes Investigation

Identifying CPU Spikes

Monitor Current CPU Usage

Historical Pattern Analysis

Investigation Workflow

Step 1: Capture Hot Threads

Step 2: Check Running Tasks

Step 3: Review Thread Pools

Step 4: Check Recent Queries

Common Causes of CPU Spikes

Cause 1: Expensive Search Queries

Cause 2: Bulk Indexing Operations

Cause 3: Segment Merging

Cause 4: Shard Recovery or Rebalancing

Cause 5: Garbage Collection

Cause 6: Cluster State Updates

Real-Time Monitoring During Spikes

System Metrics

Elasticsearch APIs

Proactive Monitoring

Set Up Alerts

Capacity Planning

Investigation Checklist

Post-Spike Analysis

Related Topics