Elasticsearch Cluster Crashing Due to High CPU

When an Elasticsearch cluster crashes due to high CPU usage, it's critical to identify the root cause quickly to restore service and prevent future occurrences. This guide provides a systematic approach to diagnosing CPU-related cluster crashes.

Identifying the Crash Pattern

Check Cluster Logs

Review logs from all nodes around the crash time:

# Recent errors in Elasticsearch logs
grep -i "error\|exception\|crash" /var/log/elasticsearch/elasticsearch.log | tail -100

Common patterns indicating CPU-related issues:

java.lang.OutOfMemoryError (GC overhead)
ThreadPool queue is full
Transport node disconnected
Master not discovered

Analyze Node State Before Crash

If the cluster is still accessible or partially running:

GET /_nodes/hot_threads?threads=10
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=rejected:desc

Root Causes of CPU-Induced Crashes

Cause 1: Runaway Queries

Description: Complex or poorly optimized queries consuming all CPU resources.

Indicators:

Hot threads showing search operations
Specific indices with high search activity
Sudden query pattern changes

Investigation:

GET /_tasks?actions=*search*&detailed=true

Solution:

// Set index-level search throttling
PUT /problematic-index/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.throttle.max_bytes_per_sec": "100mb"
}

Cause 2: Merge Storm

Description: Multiple segment merges happening simultaneously, especially after heavy indexing.

Indicators:

Hot threads showing Lucene merge operations
Recent bulk indexing activity
Many small segments

Investigation:

GET /_cat/segments?v&s=index
GET /_nodes/stats/indices/merges

Solution:

PUT /my-index/_settings
{
  "index.merge.scheduler.max_thread_count": 1,
  "index.merge.policy.max_merged_segment": "5gb"
}

Cause 3: GC Death Spiral

Description: Excessive garbage collection consuming CPU, leading to timeouts and cascading failures.

Indicators:

GC logs showing long pauses (>30 seconds)
Heap usage near 100%
Young generation GC frequency extremely high

Investigation:

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Solution:

Reduce heap pressure (see memory optimization)
Scale horizontally to distribute load
Review heap sizing

Cause 4: Cluster State Processing

Description: Large cluster state updates (mapping changes, index creation) overwhelming master and data nodes.

Indicators:

High pending tasks count
Master node CPU spike
Recent bulk index or mapping operations

Investigation:

GET /_cluster/pending_tasks
GET /_cluster/state?filter_path=metadata.cluster_uuid,version

Solution:

Reduce number of indices and shards
Use index templates to avoid runtime mapping updates
Ensure dedicated master nodes

Cause 5: Split Brain or Master Election Storm

Description: Network issues causing repeated master elections, each consuming significant CPU.

Indicators:

Multiple master changes in logs
Network timeout errors
Inconsistent cluster state across nodes

Investigation:

GET /_cat/master?v
# Check logs for master election events

Solution:

# elasticsearch.yml - ensure proper discovery
discovery.seed_hosts: ["node1", "node2", "node3"]
cluster.initial_master_nodes: ["master1", "master2", "master3"]

Immediate Recovery Steps

Step 1: Stabilize the Cluster

If nodes are still running but struggling:

// Temporarily disable shard allocation
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}

Step 2: Identify and Cancel Problematic Tasks

GET /_tasks?detailed=true&group_by=parents

// Cancel specific task
POST /_tasks/{task_id}/_cancel

Step 3: Reduce Incoming Load

Pause indexing clients
Add query throttling at the application layer
Redirect traffic away from struggling nodes

Step 4: Restart Affected Nodes

After identifying the root cause:

# Graceful node restart
systemctl stop elasticsearch
# Wait for node to leave cluster
systemctl start elasticsearch

Step 5: Re-enable Normal Operations

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

Prevention Strategies

Implement Resource Limits

# elasticsearch.yml
indices.breaker.total.limit: 70%
indices.breaker.request.limit: 60%
thread_pool.search.queue_size: 1000
thread_pool.write.queue_size: 500

Set Query Timeouts

// Default timeout for all searches
PUT /_cluster/settings
{
  "persistent": {
    "search.default_search_timeout": "30s"
  }
}

Monitor and Alert

Set up alerts for:

CPU usage > 85% for more than 5 minutes
Thread pool rejections increasing
GC time > 10% of total time
Master node changes

Capacity Planning

Maintain headroom for traffic spikes (target 50-60% average CPU)
Plan for failure scenarios (N+1 or N+2 capacity)
Regularly review and optimize query patterns

Post-Mortem Analysis

After recovering from a crash:

Collect all relevant logs from the crash period
Identify the trigger - what changed before the crash?
Review metrics - CPU, memory, disk I/O leading up to the crash
Document findings and implement preventive measures
Test fixes in a staging environment

Elasticsearch Cluster Crashing Due to High CPU - Causes and Fixes

Identifying the Crash Pattern

Check Cluster Logs

Analyze Node State Before Crash

Root Causes of CPU-Induced Crashes

Cause 1: Runaway Queries

Cause 2: Merge Storm

Cause 3: GC Death Spiral

Cause 4: Cluster State Processing

Cause 5: Split Brain or Master Election Storm

Immediate Recovery Steps

Step 1: Stabilize the Cluster

Step 2: Identify and Cancel Problematic Tasks

Step 3: Reduce Incoming Load

Step 4: Restart Affected Nodes

Step 5: Re-enable Normal Operations

Prevention Strategies

Implement Resource Limits

Set Query Timeouts

Monitor and Alert

Capacity Planning

Post-Mortem Analysis

Related Topics