Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Node CPU Spikes Investigation

CPU spikes on Elasticsearch nodes can cause query timeouts, indexing delays, and cluster instability. This guide provides a systematic approach to investigating and resolving sudden CPU increases.

Identifying CPU Spikes

Monitor Current CPU Usage

GET /_cat/nodes?v&h=name,cpu,load_1m,load_5m,load_15m,heap.percent

Historical Pattern Analysis

Check if spikes correlate with:

  • Specific times of day
  • Application deployments
  • User traffic patterns
  • Scheduled jobs (backups, index operations)

Investigation Workflow

Step 1: Capture Hot Threads

Immediately when spike occurs:

GET /_nodes/hot_threads?threads=10&interval=500ms

Multiple captures help identify patterns:

for i in {1..5}; do
  curl -s "localhost:9200/_nodes/hot_threads?threads=10" >> cpu_spike_$(date +%s).txt
  sleep 10
done

Step 2: Check Running Tasks

GET /_tasks?detailed=true&group_by=parents

Look for:

  • Long-running searches
  • Large bulk operations
  • Force merge operations

Step 3: Review Thread Pools

GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=active:desc

High activity indicates which operations are consuming resources.

Step 4: Check Recent Queries

# Enable slow logs if not already
PUT /_all/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s"
}

# Check slow query log
tail -f /var/log/elasticsearch/*_index_search_slowlog.log

Common Causes of CPU Spikes

Cause 1: Expensive Search Queries

Indicators:

  • Hot threads showing search workers
  • High search thread pool activity
  • Slow query log entries

Common culprits:

  • Leading wildcards: *term
  • Complex regex patterns
  • Deep pagination
  • Large aggregations
  • Script queries on every document

Investigation:

GET /_tasks?actions=*search*&detailed=true

Solution:

// Cancel problematic query
POST /_tasks/{task_id}/_cancel

// Add query timeout
{
  "timeout": "30s",
  "query": { ... }
}

Cause 2: Bulk Indexing Operations

Indicators:

  • Hot threads showing write or bulk operations
  • High write thread pool activity
  • Recent application deployments

Investigation:

GET /_tasks?actions=indices:data/write*&detailed=true

Solutions:

  • Reduce bulk batch size
  • Add delays between bulk requests
  • Use dedicated ingest nodes

Cause 3: Segment Merging

Indicators:

  • Hot threads showing Lucene merge operations
  • Generic thread pool busy
  • Recent heavy indexing

Investigation:

GET /_nodes/stats/indices/merges
GET /_cat/segments?v&index=*&s=segments.count:desc

Solutions:

PUT /my-index/_settings
{
  "index.merge.scheduler.max_thread_count": 1
}

Cause 4: Shard Recovery or Rebalancing

Indicators:

  • Recent node restart or addition
  • Cluster status yellow
  • Recovery operations in progress

Investigation:

GET /_cat/recovery?v&active_only=true

Solutions:

PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "50mb"
  }
}

Cause 5: Garbage Collection

Indicators:

  • Hot threads showing GC
  • High heap usage
  • GC entries in logs

Investigation:

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Solutions:

  • Reduce heap pressure
  • Scale the cluster
  • Review memory-intensive operations

Cause 6: Cluster State Updates

Indicators:

  • Hot threads on master node
  • Pending tasks increasing
  • Recent mapping or index changes

Investigation:

GET /_cluster/pending_tasks
GET /_cat/master?v

Solutions:

  • Reduce number of indices/shards
  • Batch index template updates
  • Use dedicated master nodes

Real-Time Monitoring During Spikes

System Metrics

# Real-time CPU per process
top -p $(pgrep -d',' -f elasticsearch)

# CPU breakdown
mpstat -P ALL 1

# Per-thread CPU (find Elasticsearch threads)
ps -eLf | grep elasticsearch | head -20

Elasticsearch APIs

# Combined script for investigation
watch -n 5 'curl -s "localhost:9200/_cat/nodes?v&h=name,cpu,heap.percent,load_1m"'

Proactive Monitoring

Set Up Alerts

Alert when:

  • CPU > 80% for more than 5 minutes
  • Load average > 2x CPU core count
  • Search thread pool rejections > 0

Capacity Planning

  • Maintain 40-50% CPU headroom for spikes
  • Plan for 2x expected traffic
  • Regular load testing

Investigation Checklist

When CPU spike occurs:

  • Capture hot threads (multiple times)
  • Check running tasks
  • Review thread pool statistics
  • Check slow query logs
  • Check GC metrics
  • Review recovery operations
  • Check pending cluster tasks
  • Correlate with application events
  • Document findings for future reference

Post-Spike Analysis

After resolving the spike:

  1. Root cause analysis: Document what caused the spike
  2. Implement prevention: Query governance, resource limits
  3. Update monitoring: Add alerts for early detection
  4. Review capacity: Ensure adequate headroom
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.