Elasticsearch Crashing Due to High CPU Usage

Elasticsearch crashes due to high CPU usage can severely impact your cluster's availability and performance. This guide helps you identify the root causes of CPU-related crashes and implement effective solutions.

Nodes becoming unresponsive
Timeout errors in client applications
Cluster instability or master election issues
Garbage collection pauses
Thread pool rejections increasing

Diagnosing High CPU Usage

Step 1: Identify CPU Consumption

Check CPU usage across all nodes:

GET /_nodes/stats/os

Look for the cpu.percent field to identify nodes with high CPU usage.

Step 2: Analyze Hot Threads

The hot threads API is crucial for identifying what's consuming CPU:

GET /_nodes/hot_threads?threads=10&interval=500ms

Common findings include:

Search threads executing complex queries
Merge threads during segment merging
Garbage collection threads
Bulk indexing operations

Step 3: Check Thread Pool Statistics

Review thread pool queues and rejections:

GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected

High queue values or rejections indicate overload.

Step 4: Review Recent Queries

Use the slow log or tasks API to identify expensive queries:

GET /_tasks?detailed=true&actions=*search

Common Causes and Solutions

1. Expensive Search Queries

Problem: Complex queries consuming excessive CPU

Indicators:

Hot threads showing search workers
High query latency
Search thread pool saturation

Solutions:

Enable slow query logging to identify problematic queries
Optimize queries:
- Avoid leading wildcards (*term)
- Use filters instead of queries for non-scoring clauses
- Limit aggregation bucket sizes
- Avoid deep pagination (use search_after instead)
Implement query timeouts:

{
  "timeout": "30s",
  "query": { ... }
}

2. Heavy Indexing Load

Problem: Bulk indexing operations overwhelming the cluster

Solutions:

Reduce bulk request sizes (optimal: 5-15 MB per request)
Increase refresh interval during heavy indexing:

PUT /my-index/_settings
{
  "index.refresh_interval": "30s"
}

Use multiple indexing clients to distribute load
Consider dedicated ingest nodes

3. Segment Merging

Problem: Background merge operations consuming CPU

Solutions:

Adjust merge policy settings:

PUT /my-index/_settings
{
  "index.merge.scheduler.max_thread_count": 1
}

Schedule force merges during off-peak hours
Use time-based indices with ILM

4. Garbage Collection Pressure

Problem: Frequent GC cycles consuming CPU

Indicators:

High GC overhead in logs
JVM heap pressure > 85%

Solutions:

Review and optimize heap size
Reduce memory-intensive operations
Add more nodes to distribute load

5. Too Many Shards

Problem: Excessive shards creating coordination overhead

Solutions:

Consolidate small indices
Implement proper shard sizing (10-50 GB per shard)
Use ILM to manage index lifecycle

Preventive Measures

Set Resource Limits

Configure circuit breakers to prevent runaway operations:

# elasticsearch.yml
indices.breaker.total.limit: 70%
indices.breaker.request.limit: 60%

Implement Query Governance

Set default timeouts for all queries
Use query validation before execution
Implement rate limiting for search APIs

Monitor Proactively

Set up alerts for:

CPU usage > 80% sustained for 5+ minutes
Thread pool rejections
Slow query frequency increases

Recovery Steps After a Crash

Check cluster health: GET /_cluster/health
Review logs: Check elasticsearch.log for error messages
Identify the cause: Use hot threads and slow logs
Stabilize the cluster: Cancel problematic tasks if needed

POST /_tasks/{task_id}/_cancel

Implement fixes: Address the root cause before resuming normal operations

Elasticsearch Crashing Due to High CPU Usage

Symptoms of CPU-Related Crashes

Diagnosing High CPU Usage

Step 1: Identify CPU Consumption

Step 2: Analyze Hot Threads

Step 3: Check Thread Pool Statistics

Step 4: Review Recent Queries

Common Causes and Solutions

1. Expensive Search Queries

2. Heavy Indexing Load

3. Segment Merging

4. Garbage Collection Pressure

5. Too Many Shards

Preventive Measures

Set Resource Limits

Implement Query Governance

Monitor Proactively

Recovery Steps After a Crash

Related Topics