Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch JVM GC Freeze Analysis

JVM garbage collection freezes occur when the garbage collector pauses all application threads to reclaim memory. In Elasticsearch, these "stop-the-world" pauses can cause node timeouts, cluster instability, and degraded performance. This guide helps you analyze and resolve GC freeze issues.

Understanding GC Freezes

Stop-the-World Pauses

During certain GC phases, the JVM must pause all application threads. These pauses are called "stop-the-world" (STW) events.

Impact on Elasticsearch:

  • Node appears unresponsive during pause
  • Other nodes may consider it disconnected
  • Queries and indexing operations timeout
  • Master election can be triggered

Types of Pauses

GC Type Typical Duration Cause
Young GC 10-100 ms Normal, frequent
Mixed GC (G1) 100-500 ms Normal, less frequent
Full GC 1-60+ seconds Problematic

Enabling GC Analysis

Configure GC Logging

# jvm.options.d/gc_logging.options
-Xlog:gc*:file=/var/log/elasticsearch/gc.log:time,pid,tags:filecount=10,filesize=64m

For older JVMs (Java 8):

-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-Xloggc:/var/log/elasticsearch/gc.log

Monitor via API

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Analyzing GC Logs

Key Metrics to Extract

  1. Pause time: Duration of each STW event
  2. Frequency: How often GC occurs
  3. Memory reclaimed: How much memory is freed
  4. Promotion rate: Objects moving from young to old generation

Sample G1GC Log Analysis

[2024-01-15T10:30:45.123+0000][gc] GC(1234) Pause Young (G1 Evacuation Pause) 24M->18M(256M) 45.678ms

Breakdown:

  • Pause Young: Young generation collection
  • 24M->18M: Heap before -> after GC
  • (256M): Total heap size
  • 45.678ms: Pause duration

Identifying Long Pauses

# Find pauses longer than 1 second
grep -E "pause.*[0-9]{4,}ms|[0-9]+\.[0-9]{3}s" /var/log/elasticsearch/gc.log

# Count pauses by duration range
awk '/pause/ {
  if (/[0-9]+ms/) {
    match($0, /([0-9]+)ms/, a)
    if (a[1] > 1000) print "LONG:", $0
    else if (a[1] > 500) print "MEDIUM:", $0
  }
}' /var/log/elasticsearch/gc.log

Full GC Detection

Full GCs indicate severe memory pressure:

grep -i "Full GC\|Pause Full" /var/log/elasticsearch/gc.log

Common Causes of GC Freezes

Cause 1: Heap Too Large for Compressed Oops

If heap exceeds ~32 GB, compressed oops is disabled, causing larger object overhead.

Diagnosis:

grep "compressed ordinary object pointers" /var/log/elasticsearch/*.log
# Should show [true]

Solution: Keep heap under 31 GB.

Cause 2: High Allocation Rate

Too many objects being created too quickly.

Diagnosis: Young GC running constantly (many times per second).

Solutions:

  • Reduce bulk request sizes
  • Optimize queries to create fewer intermediate objects
  • Reduce aggregation complexity

Cause 3: Memory Leak

Objects not being released, filling old generation.

Diagnosis: Old generation keeps growing, Full GCs become frequent.

Solution:

  • Take heap dump and analyze with tools like Eclipse MAT
  • Update Elasticsearch if bug-related
  • Check for plugin issues

Cause 4: Humongous Object Allocation

G1GC struggles with objects larger than half a region.

Diagnosis:

grep -i "humongous" /var/log/elasticsearch/gc.log

Solution:

# Increase G1 region size
-XX:G1HeapRegionSize=32m

Tuning GC for Fewer Freezes

G1GC Tuning (Recommended)

# jvm.options.d/gc_tuning.options
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-XX:MaxGCPauseMillis=200
-XX:+ParallelRefProcEnabled
-XX:G1HeapWastePercent=5

Explanation of Key Settings

Setting Purpose
G1HeapRegionSize Larger regions reduce humongous allocations
G1ReservePercent Reserve memory to avoid promotion failure
InitiatingHeapOccupancyPercent Start concurrent marking earlier
MaxGCPauseMillis Target maximum pause time

Cluster Settings for Tolerance

Make the cluster more tolerant of GC pauses:

# elasticsearch.yml
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.follower_check.timeout: 30s

Capturing Heap Dumps for Analysis

During High Memory

# Generate heap dump
jmap -dump:format=b,file=/tmp/heap_dump.hprof <elasticsearch_pid>

On OOM

# jvm.options.d/heap_dump.options
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/log/elasticsearch/

Analyzing Heap Dumps

Use Eclipse Memory Analyzer (MAT):

  1. Open heap dump
  2. Run "Leak Suspects Report"
  3. Analyze dominator tree
  4. Look for unexpected large objects

Monitoring and Alerting

Key Metrics

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc.collectors

Track:

  • collection_count: Should be stable, not rapidly increasing
  • collection_time_in_millis: Average time per collection

Alert Thresholds

Condition Severity
GC pause > 500 ms Warning
GC pause > 2 s Critical
Full GC occurring Critical
GC overhead > 10% Critical

Quick Reference: GC Freeze Checklist

  • GC logging enabled
  • Heap under 31 GB (compressed oops)
  • Heap = 50% of RAM
  • G1GC in use (default for modern ES)
  • No Full GCs in logs
  • Average GC pause < 200 ms
  • Cluster timeouts accommodate occasional longer pauses
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.