JVM garbage collection freezes occur when the garbage collector pauses all application threads to reclaim memory. In Elasticsearch, these "stop-the-world" pauses can cause node timeouts, cluster instability, and degraded performance. This guide helps you analyze and resolve GC freeze issues.
Understanding GC Freezes
Stop-the-World Pauses
During certain GC phases, the JVM must pause all application threads. These pauses are called "stop-the-world" (STW) events.
Impact on Elasticsearch:
- Node appears unresponsive during pause
- Other nodes may consider it disconnected
- Queries and indexing operations timeout
- Master election can be triggered
Types of Pauses
| GC Type | Typical Duration | Cause |
|---|---|---|
| Young GC | 10-100 ms | Normal, frequent |
| Mixed GC (G1) | 100-500 ms | Normal, less frequent |
| Full GC | 1-60+ seconds | Problematic |
Enabling GC Analysis
Configure GC Logging
# jvm.options.d/gc_logging.options
-Xlog:gc*:file=/var/log/elasticsearch/gc.log:time,pid,tags:filecount=10,filesize=64m
For older JVMs (Java 8):
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-Xloggc:/var/log/elasticsearch/gc.log
Monitor via API
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc
Analyzing GC Logs
Key Metrics to Extract
- Pause time: Duration of each STW event
- Frequency: How often GC occurs
- Memory reclaimed: How much memory is freed
- Promotion rate: Objects moving from young to old generation
Sample G1GC Log Analysis
[2024-01-15T10:30:45.123+0000][gc] GC(1234) Pause Young (G1 Evacuation Pause) 24M->18M(256M) 45.678ms
Breakdown:
Pause Young: Young generation collection24M->18M: Heap before -> after GC(256M): Total heap size45.678ms: Pause duration
Identifying Long Pauses
# Find pauses longer than 1 second
grep -E "pause.*[0-9]{4,}ms|[0-9]+\.[0-9]{3}s" /var/log/elasticsearch/gc.log
# Count pauses by duration range
awk '/pause/ {
if (/[0-9]+ms/) {
match($0, /([0-9]+)ms/, a)
if (a[1] > 1000) print "LONG:", $0
else if (a[1] > 500) print "MEDIUM:", $0
}
}' /var/log/elasticsearch/gc.log
Full GC Detection
Full GCs indicate severe memory pressure:
grep -i "Full GC\|Pause Full" /var/log/elasticsearch/gc.log
Common Causes of GC Freezes
Cause 1: Heap Too Large for Compressed Oops
If heap exceeds ~32 GB, compressed oops is disabled, causing larger object overhead.
Diagnosis:
grep "compressed ordinary object pointers" /var/log/elasticsearch/*.log
# Should show [true]
Solution: Keep heap under 31 GB.
Cause 2: High Allocation Rate
Too many objects being created too quickly.
Diagnosis: Young GC running constantly (many times per second).
Solutions:
- Reduce bulk request sizes
- Optimize queries to create fewer intermediate objects
- Reduce aggregation complexity
Cause 3: Memory Leak
Objects not being released, filling old generation.
Diagnosis: Old generation keeps growing, Full GCs become frequent.
Solution:
- Take heap dump and analyze with tools like Eclipse MAT
- Update Elasticsearch if bug-related
- Check for plugin issues
Cause 4: Humongous Object Allocation
G1GC struggles with objects larger than half a region.
Diagnosis:
grep -i "humongous" /var/log/elasticsearch/gc.log
Solution:
# Increase G1 region size
-XX:G1HeapRegionSize=32m
Tuning GC for Fewer Freezes
G1GC Tuning (Recommended)
# jvm.options.d/gc_tuning.options
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-XX:MaxGCPauseMillis=200
-XX:+ParallelRefProcEnabled
-XX:G1HeapWastePercent=5
Explanation of Key Settings
| Setting | Purpose |
|---|---|
G1HeapRegionSize |
Larger regions reduce humongous allocations |
G1ReservePercent |
Reserve memory to avoid promotion failure |
InitiatingHeapOccupancyPercent |
Start concurrent marking earlier |
MaxGCPauseMillis |
Target maximum pause time |
Cluster Settings for Tolerance
Make the cluster more tolerant of GC pauses:
# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.follower_check.timeout: 30s
Capturing Heap Dumps for Analysis
During High Memory
# Generate heap dump
jmap -dump:format=b,file=/tmp/heap_dump.hprof <elasticsearch_pid>
On OOM
# jvm.options.d/heap_dump.options
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch/
Analyzing Heap Dumps
Use Eclipse Memory Analyzer (MAT):
- Open heap dump
- Run "Leak Suspects Report"
- Analyze dominator tree
- Look for unexpected large objects
Monitoring and Alerting
Key Metrics
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc.collectors
Track:
collection_count: Should be stable, not rapidly increasingcollection_time_in_millis: Average time per collection
Alert Thresholds
| Condition | Severity |
|---|---|
| GC pause > 500 ms | Warning |
| GC pause > 2 s | Critical |
| Full GC occurring | Critical |
| GC overhead > 10% | Critical |
Quick Reference: GC Freeze Checklist
- GC logging enabled
- Heap under 31 GB (compressed oops)
- Heap = 50% of RAM
- G1GC in use (default for modern ES)
- No Full GCs in logs
- Average GC pause < 200 ms
- Cluster timeouts accommodate occasional longer pauses