Elasticsearch OutOfMemoryError Troubleshooting

OutOfMemoryError (OOM) in Elasticsearch causes node crashes and cluster instability. This guide covers the different types of OOM errors, their causes, and how to resolve them.

Types of OutOfMemoryError

1. Java Heap Space

java.lang.OutOfMemoryError: Java heap space

The JVM cannot allocate an object because the heap is full and garbage collection cannot free enough memory.

2. GC Overhead Limit Exceeded

java.lang.OutOfMemoryError: GC overhead limit exceeded

The JVM is spending more than 98% of time doing garbage collection while recovering less than 2% of memory.

3. Unable to Create Native Thread

java.lang.OutOfMemoryError: unable to create native thread

The system cannot create more threads, often due to OS limits or memory exhaustion outside the heap.

4. Direct Buffer Memory

java.lang.OutOfMemoryError: Direct buffer memory

Off-heap direct memory allocation failed.

Immediate Actions When OOM Occurs

Step 1: Check Which Nodes Are Affected

GET /_cat/nodes?v&h=name,heap.percent,heap.current,heap.max

Step 2: Review Elasticsearch Logs

grep -i "OutOfMemoryError\|heap space\|GC overhead" /var/log/elasticsearch/*.log

Step 3: Reduce Cluster Load

// Disable allocation to prevent shard movements
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}

Step 4: Restart Affected Nodes

After addressing the cause, restart nodes one at a time.

Troubleshooting Java Heap Space OOM

Diagnose the Cause

Check heap configuration:

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.mem

Analyze what's consuming memory:

GET /_nodes/stats/indices?filter_path=nodes.*.indices.fielddata,nodes.*.indices.query_cache,nodes.*.indices.request_cache,nodes.*.indices.segments

Review active tasks:

GET /_tasks?detailed=true

Common Causes and Fixes

Cause: Heap size too small

# jvm.options.d/heap.options
# Heap should be about half of RAM but never above 32 GB
-Xms16g
-Xmx16g

Cause: Too many shards

GET /_cluster/stats?filter_path=indices.shards.total

Consolidate shards - aim for 10-50 GB per shard.

Cause: Fielddata on text fields

GET /_cat/fielddata?v&fields=*

// Disable fielddata on text fields
PUT /my-index/_mapping
{
  "properties": {
    "my_field": {
      "type": "text",
      "fielddata": false
    }
  }
}

Cause: Large aggregations

Reduce aggregation sizes:

{
  "aggs": {
    "my_terms": {
      "terms": {
        "field": "category",
        "size": 100  // Instead of 10000
      }
    }
  }
}

Troubleshooting GC Overhead OOM

Diagnose

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Look for:

Very high collection_count
Long collection_time_in_millis

Fixes

Increase heap (if under 31 GB)
Reduce memory pressure:
- Clear caches: POST /_cache/clear
- Cancel expensive operations
- Reduce concurrent operations
Tune GC:

# jvm.options.d/gc.options
-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=30
-XX:G1ReservePercent=25

Troubleshooting Native Thread OOM

Diagnose

# Check current thread count
cat /proc/<es_pid>/status | grep Threads

# Check system limits
ulimit -u
cat /proc/sys/kernel/threads-max

Fixes

Increase system limits:

# /etc/security/limits.conf
elasticsearch  -  nproc  4096

Reduce Elasticsearch thread pools:

# elasticsearch.yml
thread_pool.search.size: 13
thread_pool.write.size: 5

Add more nodes to distribute workload

Troubleshooting Direct Buffer OOM

Diagnose

Direct memory is used for network I/O and some internal operations.

# Check direct memory settings
grep "MaxDirectMemorySize" /etc/elasticsearch/jvm.options.d/*

Fixes

Set direct memory explicitly:

# jvm.options.d/memory.options
-XX:MaxDirectMemorySize=256m

Ensure heap + direct memory doesn't exceed available RAM

Preventing Future OOM Errors

Configure Circuit Breakers

PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.total.limit": "70%",
    "indices.breaker.request.limit": "40%",
    "indices.breaker.fielddata.limit": "40%"
  }
}

Enable Heap Dumps

# jvm.options.d/heap_dump.options
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch/
-XX:+ExitOnOutOfMemoryError

Note: ExitOnOutOfMemoryError causes the JVM to exit on OOM, which is often better than running in a degraded state.

Set Up Monitoring

Monitor and alert on:

Heap usage > 85%
GC time > 10% of total time
Circuit breaker trips

Memory Lock

Prevent the OS from swapping Elasticsearch memory:

# elasticsearch.yml
bootstrap.memory_lock: true

# /etc/security/limits.conf
elasticsearch  -  memlock  unlimited

Analyzing Heap Dumps

Generate Heap Dump

jmap -dump:format=b,file=/tmp/heap.hprof <elasticsearch_pid>

Analyze with Eclipse MAT

Download Eclipse Memory Analyzer
Open the heap dump
Run "Leak Suspects Report"
Look at "Dominator Tree" for largest objects

Common Findings

Finding	Likely Cause
Large `char[]` arrays	Fielddata or large text fields
Many `Segment` objects	Too many shards
Query cache objects	Complex queries being cached
Aggregation buckets	Large aggregations

Recovery Checklist

After experiencing OOM:

Node restarted successfully
Root cause identified
Configuration updated to prevent recurrence
Circuit breakers adjusted
Monitoring alerts set up
Cluster allocation re-enabled
Cluster health returned to green

Elasticsearch OutOfMemoryError Troubleshooting

Types of OutOfMemoryError

1. Java Heap Space

2. GC Overhead Limit Exceeded

3. Unable to Create Native Thread

4. Direct Buffer Memory

Immediate Actions When OOM Occurs

Step 1: Check Which Nodes Are Affected

Step 2: Review Elasticsearch Logs

Step 3: Reduce Cluster Load

Step 4: Restart Affected Nodes

Troubleshooting Java Heap Space OOM

Diagnose the Cause

Common Causes and Fixes

Troubleshooting GC Overhead OOM

Diagnose

Fixes

Troubleshooting Native Thread OOM

Diagnose

Fixes

Troubleshooting Direct Buffer OOM

Diagnose

Fixes

Preventing Future OOM Errors

Configure Circuit Breakers

Enable Heap Dumps

Set Up Monitoring

Memory Lock

Analyzing Heap Dumps

Generate Heap Dump

Analyze with Eclipse MAT

Common Findings

Recovery Checklist

Related Topics