Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Crashing Due to High CPU Usage

Elasticsearch crashes due to high CPU usage can severely impact your cluster's availability and performance. This guide helps you identify the root causes of CPU-related crashes and implement effective solutions.

  • Nodes becoming unresponsive
  • Timeout errors in client applications
  • Cluster instability or master election issues
  • Garbage collection pauses
  • Thread pool rejections increasing

Diagnosing High CPU Usage

Step 1: Identify CPU Consumption

Check CPU usage across all nodes:

GET /_nodes/stats/os

Look for the cpu.percent field to identify nodes with high CPU usage.

Step 2: Analyze Hot Threads

The hot threads API is crucial for identifying what's consuming CPU:

GET /_nodes/hot_threads?threads=10&interval=500ms

Common findings include:

  • Search threads executing complex queries
  • Merge threads during segment merging
  • Garbage collection threads
  • Bulk indexing operations

Step 3: Check Thread Pool Statistics

Review thread pool queues and rejections:

GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected

High queue values or rejections indicate overload.

Step 4: Review Recent Queries

Use the slow log or tasks API to identify expensive queries:

GET /_tasks?detailed=true&actions=*search

Common Causes and Solutions

1. Expensive Search Queries

Problem: Complex queries consuming excessive CPU

Indicators:

  • Hot threads showing search workers
  • High query latency
  • Search thread pool saturation

Solutions:

  • Enable slow query logging to identify problematic queries
  • Optimize queries:
    • Avoid leading wildcards (*term)
    • Use filters instead of queries for non-scoring clauses
    • Limit aggregation bucket sizes
    • Avoid deep pagination (use search_after instead)
  • Implement query timeouts:
{
  "timeout": "30s",
  "query": { ... }
}

2. Heavy Indexing Load

Problem: Bulk indexing operations overwhelming the cluster

Solutions:

  • Reduce bulk request sizes (optimal: 5-15 MB per request)
  • Increase refresh interval during heavy indexing:
PUT /my-index/_settings
{
  "index.refresh_interval": "30s"
}
  • Use multiple indexing clients to distribute load
  • Consider dedicated ingest nodes

3. Segment Merging

Problem: Background merge operations consuming CPU

Solutions:

  • Adjust merge policy settings:
PUT /my-index/_settings
{
  "index.merge.scheduler.max_thread_count": 1
}
  • Schedule force merges during off-peak hours
  • Use time-based indices with ILM

4. Garbage Collection Pressure

Problem: Frequent GC cycles consuming CPU

Indicators:

  • High GC overhead in logs
  • JVM heap pressure > 85%

Solutions:

  • Review and optimize heap size
  • Reduce memory-intensive operations
  • Add more nodes to distribute load

5. Too Many Shards

Problem: Excessive shards creating coordination overhead

Solutions:

  • Consolidate small indices
  • Implement proper shard sizing (10-50 GB per shard)
  • Use ILM to manage index lifecycle

Preventive Measures

Set Resource Limits

Configure circuit breakers to prevent runaway operations:

# elasticsearch.yml
indices.breaker.total.limit: 70%
indices.breaker.request.limit: 60%

Implement Query Governance

  • Set default timeouts for all queries
  • Use query validation before execution
  • Implement rate limiting for search APIs

Monitor Proactively

Set up alerts for:

  • CPU usage > 80% sustained for 5+ minutes
  • Thread pool rejections
  • Slow query frequency increases

Recovery Steps After a Crash

  1. Check cluster health: GET /_cluster/health
  2. Review logs: Check elasticsearch.log for error messages
  3. Identify the cause: Use hot threads and slow logs
  4. Stabilize the cluster: Cancel problematic tasks if needed
POST /_tasks/{task_id}/_cancel
  1. Implement fixes: Address the root cause before resuming normal operations
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.