OutOfMemoryError (OOM) in Elasticsearch causes node crashes and cluster instability. This guide covers the different types of OOM errors, their causes, and how to resolve them.
Types of OutOfMemoryError
1. Java Heap Space
java.lang.OutOfMemoryError: Java heap space
The JVM cannot allocate an object because the heap is full and garbage collection cannot free enough memory.
2. GC Overhead Limit Exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
The JVM is spending more than 98% of time doing garbage collection while recovering less than 2% of memory.
3. Unable to Create Native Thread
java.lang.OutOfMemoryError: unable to create native thread
The system cannot create more threads, often due to OS limits or memory exhaustion outside the heap.
4. Direct Buffer Memory
java.lang.OutOfMemoryError: Direct buffer memory
Off-heap direct memory allocation failed.
Immediate Actions When OOM Occurs
Step 1: Check Which Nodes Are Affected
GET /_cat/nodes?v&h=name,heap.percent,heap.current,heap.max
Step 2: Review Elasticsearch Logs
grep -i "OutOfMemoryError\|heap space\|GC overhead" /var/log/elasticsearch/*.log
Step 3: Reduce Cluster Load
// Disable allocation to prevent shard movements
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "none"
}
}
Step 4: Restart Affected Nodes
After addressing the cause, restart nodes one at a time.
Troubleshooting Java Heap Space OOM
Diagnose the Cause
- Check heap configuration:
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.mem
- Analyze what's consuming memory:
GET /_nodes/stats/indices?filter_path=nodes.*.indices.fielddata,nodes.*.indices.query_cache,nodes.*.indices.request_cache,nodes.*.indices.segments
- Review active tasks:
GET /_tasks?detailed=true
Common Causes and Fixes
Cause: Heap size too small
# jvm.options.d/heap.options
# Heap should be about half of RAM but never above 32 GB
-Xms16g
-Xmx16g
Cause: Too many shards
GET /_cluster/stats?filter_path=indices.shards.total
Consolidate shards - aim for 10-50 GB per shard.
Cause: Fielddata on text fields
GET /_cat/fielddata?v&fields=*
// Disable fielddata on text fields
PUT /my-index/_mapping
{
"properties": {
"my_field": {
"type": "text",
"fielddata": false
}
}
}
Cause: Large aggregations
Reduce aggregation sizes:
{
"aggs": {
"my_terms": {
"terms": {
"field": "category",
"size": 100 // Instead of 10000
}
}
}
}
Troubleshooting GC Overhead OOM
Diagnose
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc
Look for:
- Very high
collection_count - Long
collection_time_in_millis
Fixes
Increase heap (if under 31 GB)
Reduce memory pressure:
- Clear caches:
POST /_cache/clear - Cancel expensive operations
- Reduce concurrent operations
- Clear caches:
Tune GC:
# jvm.options.d/gc.options
-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=30
-XX:G1ReservePercent=25
Troubleshooting Native Thread OOM
Diagnose
# Check current thread count
cat /proc/<es_pid>/status | grep Threads
# Check system limits
ulimit -u
cat /proc/sys/kernel/threads-max
Fixes
- Increase system limits:
# /etc/security/limits.conf
elasticsearch - nproc 4096
- Reduce Elasticsearch thread pools:
# elasticsearch.yml
thread_pool.search.size: 13
thread_pool.write.size: 5
- Add more nodes to distribute workload
Troubleshooting Direct Buffer OOM
Diagnose
Direct memory is used for network I/O and some internal operations.
# Check direct memory settings
grep "MaxDirectMemorySize" /etc/elasticsearch/jvm.options.d/*
Fixes
- Set direct memory explicitly:
# jvm.options.d/memory.options
-XX:MaxDirectMemorySize=256m
- Ensure heap + direct memory doesn't exceed available RAM
Preventing Future OOM Errors
Configure Circuit Breakers
PUT /_cluster/settings
{
"persistent": {
"indices.breaker.total.limit": "70%",
"indices.breaker.request.limit": "40%",
"indices.breaker.fielddata.limit": "40%"
}
}
Enable Heap Dumps
# jvm.options.d/heap_dump.options
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch/
-XX:+ExitOnOutOfMemoryError
Note: ExitOnOutOfMemoryError causes the JVM to exit on OOM, which is often better than running in a degraded state.
Set Up Monitoring
Monitor and alert on:
- Heap usage > 85%
- GC time > 10% of total time
- Circuit breaker trips
Memory Lock
Prevent the OS from swapping Elasticsearch memory:
# elasticsearch.yml
bootstrap.memory_lock: true
# /etc/security/limits.conf
elasticsearch - memlock unlimited
Analyzing Heap Dumps
Generate Heap Dump
jmap -dump:format=b,file=/tmp/heap.hprof <elasticsearch_pid>
Analyze with Eclipse MAT
- Download Eclipse Memory Analyzer
- Open the heap dump
- Run "Leak Suspects Report"
- Look at "Dominator Tree" for largest objects
Common Findings
| Finding | Likely Cause |
|---|---|
Large char[] arrays |
Fielddata or large text fields |
Many Segment objects |
Too many shards |
| Query cache objects | Complex queries being cached |
| Aggregation buckets | Large aggregations |
Recovery Checklist
After experiencing OOM:
- Node restarted successfully
- Root cause identified
- Configuration updated to prevent recurrence
- Circuit breakers adjusted
- Monitoring alerts set up
- Cluster allocation re-enabled
- Cluster health returned to green