Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Cluster Crashing Due to High CPU - Causes and Fixes

When an Elasticsearch cluster crashes due to high CPU usage, it's critical to identify the root cause quickly to restore service and prevent future occurrences. This guide provides a systematic approach to diagnosing CPU-related cluster crashes.

Identifying the Crash Pattern

Check Cluster Logs

Review logs from all nodes around the crash time:

# Recent errors in Elasticsearch logs
grep -i "error\|exception\|crash" /var/log/elasticsearch/elasticsearch.log | tail -100

Common patterns indicating CPU-related issues:

  • java.lang.OutOfMemoryError (GC overhead)
  • ThreadPool queue is full
  • Transport node disconnected
  • Master not discovered

Analyze Node State Before Crash

If the cluster is still accessible or partially running:

GET /_nodes/hot_threads?threads=10
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=rejected:desc

Root Causes of CPU-Induced Crashes

Cause 1: Runaway Queries

Description: Complex or poorly optimized queries consuming all CPU resources.

Indicators:

  • Hot threads showing search operations
  • Specific indices with high search activity
  • Sudden query pattern changes

Investigation:

GET /_tasks?actions=*search*&detailed=true

Solution:

// Set index-level search throttling
PUT /problematic-index/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.throttle.max_bytes_per_sec": "100mb"
}

Cause 2: Merge Storm

Description: Multiple segment merges happening simultaneously, especially after heavy indexing.

Indicators:

  • Hot threads showing Lucene merge operations
  • Recent bulk indexing activity
  • Many small segments

Investigation:

GET /_cat/segments?v&s=index
GET /_nodes/stats/indices/merges

Solution:

PUT /my-index/_settings
{
  "index.merge.scheduler.max_thread_count": 1,
  "index.merge.policy.max_merged_segment": "5gb"
}

Cause 3: GC Death Spiral

Description: Excessive garbage collection consuming CPU, leading to timeouts and cascading failures.

Indicators:

  • GC logs showing long pauses (>30 seconds)
  • Heap usage near 100%
  • Young generation GC frequency extremely high

Investigation:

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Solution:

  • Reduce heap pressure (see memory optimization)
  • Scale horizontally to distribute load
  • Review heap sizing

Cause 4: Cluster State Processing

Description: Large cluster state updates (mapping changes, index creation) overwhelming master and data nodes.

Indicators:

  • High pending tasks count
  • Master node CPU spike
  • Recent bulk index or mapping operations

Investigation:

GET /_cluster/pending_tasks
GET /_cluster/state?filter_path=metadata.cluster_uuid,version

Solution:

  • Reduce number of indices and shards
  • Use index templates to avoid runtime mapping updates
  • Ensure dedicated master nodes

Cause 5: Split Brain or Master Election Storm

Description: Network issues causing repeated master elections, each consuming significant CPU.

Indicators:

  • Multiple master changes in logs
  • Network timeout errors
  • Inconsistent cluster state across nodes

Investigation:

GET /_cat/master?v
# Check logs for master election events

Solution:

# elasticsearch.yml - ensure proper discovery
discovery.seed_hosts: ["node1", "node2", "node3"]
cluster.initial_master_nodes: ["master1", "master2", "master3"]

Immediate Recovery Steps

Step 1: Stabilize the Cluster

If nodes are still running but struggling:

// Temporarily disable shard allocation
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}

Step 2: Identify and Cancel Problematic Tasks

GET /_tasks?detailed=true&group_by=parents

// Cancel specific task
POST /_tasks/{task_id}/_cancel

Step 3: Reduce Incoming Load

  • Pause indexing clients
  • Add query throttling at the application layer
  • Redirect traffic away from struggling nodes

Step 4: Restart Affected Nodes

After identifying the root cause:

# Graceful node restart
systemctl stop elasticsearch
# Wait for node to leave cluster
systemctl start elasticsearch

Step 5: Re-enable Normal Operations

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

Prevention Strategies

Implement Resource Limits

# elasticsearch.yml
indices.breaker.total.limit: 70%
indices.breaker.request.limit: 60%
thread_pool.search.queue_size: 1000
thread_pool.write.queue_size: 500

Set Query Timeouts

// Default timeout for all searches
PUT /_cluster/settings
{
  "persistent": {
    "search.default_search_timeout": "30s"
  }
}

Monitor and Alert

Set up alerts for:

  • CPU usage > 85% for more than 5 minutes
  • Thread pool rejections increasing
  • GC time > 10% of total time
  • Master node changes

Capacity Planning

  • Maintain headroom for traffic spikes (target 50-60% average CPU)
  • Plan for failure scenarios (N+1 or N+2 capacity)
  • Regularly review and optimize query patterns

Post-Mortem Analysis

After recovering from a crash:

  1. Collect all relevant logs from the crash period
  2. Identify the trigger - what changed before the crash?
  3. Review metrics - CPU, memory, disk I/O leading up to the crash
  4. Document findings and implement preventive measures
  5. Test fixes in a staging environment
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.