Elasticsearch JVM GC Freeze Analysis

JVM garbage collection freezes occur when the garbage collector pauses all application threads to reclaim memory. In Elasticsearch, these "stop-the-world" pauses can cause node timeouts, cluster instability, and degraded performance. This guide helps you analyze and resolve GC freeze issues.

Understanding GC Freezes

Stop-the-World Pauses

During certain GC phases, the JVM must pause all application threads. These pauses are called "stop-the-world" (STW) events.

Impact on Elasticsearch:

Node appears unresponsive during pause
Other nodes may consider it disconnected
Queries and indexing operations timeout
Master election can be triggered

Types of Pauses

GC Type	Typical Duration	Cause
Young GC	10-100 ms	Normal, frequent
Mixed GC (G1)	100-500 ms	Normal, less frequent
Full GC	1-60+ seconds	Problematic

Enabling GC Analysis

Configure GC Logging

# jvm.options.d/gc_logging.options
-Xlog:gc*:file=/var/log/elasticsearch/gc.log:time,pid,tags:filecount=10,filesize=64m

For older JVMs (Java 8):

-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-Xloggc:/var/log/elasticsearch/gc.log

Monitor via API

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Analyzing GC Logs

Key Metrics to Extract

Pause time: Duration of each STW event
Frequency: How often GC occurs
Memory reclaimed: How much memory is freed
Promotion rate: Objects moving from young to old generation

Sample G1GC Log Analysis

[2024-01-15T10:30:45.123+0000][gc] GC(1234) Pause Young (G1 Evacuation Pause) 24M->18M(256M) 45.678ms

Breakdown:

Pause Young: Young generation collection
24M->18M: Heap before -> after GC
(256M): Total heap size
45.678ms: Pause duration

Identifying Long Pauses

# Find pauses longer than 1 second
grep -E "pause.*[0-9]{4,}ms|[0-9]+\.[0-9]{3}s" /var/log/elasticsearch/gc.log

# Count pauses by duration range
awk '/pause/ {
  if (/[0-9]+ms/) {
    match($0, /([0-9]+)ms/, a)
    if (a[1] > 1000) print "LONG:", $0
    else if (a[1] > 500) print "MEDIUM:", $0
  }
}' /var/log/elasticsearch/gc.log

Full GC Detection

Full GCs indicate severe memory pressure:

grep -i "Full GC\|Pause Full" /var/log/elasticsearch/gc.log

Common Causes of GC Freezes

Cause 1: Heap Too Large for Compressed Oops

If heap exceeds ~32 GB, compressed oops is disabled, causing larger object overhead.

Diagnosis:

grep "compressed ordinary object pointers" /var/log/elasticsearch/*.log
# Should show [true]

Solution: Keep heap under 31 GB.

Cause 2: High Allocation Rate

Too many objects being created too quickly.

Diagnosis: Young GC running constantly (many times per second).

Solutions:

Reduce bulk request sizes
Optimize queries to create fewer intermediate objects
Reduce aggregation complexity

Cause 3: Memory Leak

Objects not being released, filling old generation.

Diagnosis: Old generation keeps growing, Full GCs become frequent.

Solution:

Take heap dump and analyze with tools like Eclipse MAT
Update Elasticsearch if bug-related
Check for plugin issues

Cause 4: Humongous Object Allocation

G1GC struggles with objects larger than half a region.

Diagnosis:

grep -i "humongous" /var/log/elasticsearch/gc.log

Solution:

# Increase G1 region size
-XX:G1HeapRegionSize=32m

Tuning GC for Fewer Freezes

G1GC Tuning (Recommended)

# jvm.options.d/gc_tuning.options
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-XX:MaxGCPauseMillis=200
-XX:+ParallelRefProcEnabled
-XX:G1HeapWastePercent=5

Explanation of Key Settings

Setting	Purpose
`G1HeapRegionSize`	Larger regions reduce humongous allocations
`G1ReservePercent`	Reserve memory to avoid promotion failure
`InitiatingHeapOccupancyPercent`	Start concurrent marking earlier
`MaxGCPauseMillis`	Target maximum pause time

Cluster Settings for Tolerance

Make the cluster more tolerant of GC pauses:

# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.follower_check.timeout: 30s

Capturing Heap Dumps for Analysis

During High Memory

# Generate heap dump
jmap -dump:format=b,file=/tmp/heap_dump.hprof <elasticsearch_pid>

On OOM

# jvm.options.d/heap_dump.options
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch/

Analyzing Heap Dumps

Use Eclipse Memory Analyzer (MAT):

Open heap dump
Run "Leak Suspects Report"
Analyze dominator tree
Look for unexpected large objects

Monitoring and Alerting

Key Metrics

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc.collectors

Track:

collection_count: Should be stable, not rapidly increasing
collection_time_in_millis: Average time per collection

Alert Thresholds

Condition	Severity
GC pause > 500 ms	Warning
GC pause > 2 s	Critical
Full GC occurring	Critical
GC overhead > 10%	Critical

Quick Reference: GC Freeze Checklist

GC logging enabled
Heap under 31 GB (compressed oops)
Heap = 50% of RAM
G1GC in use (default for modern ES)
No Full GCs in logs
Average GC pause < 200 ms
Cluster timeouts accommodate occasional longer pauses