Elasticsearch JVM G1GC Tuning and Garbage Collection Troubleshooting

Elasticsearch runs on the JVM, and garbage collection behavior directly affects query latency, indexing throughput, and cluster stability. Since version 7.x (and specifically when running on JDK 14+), Elasticsearch defaults to the G1 garbage collector. G1GC replaced CMS as the default because it handles large heaps more predictably, targets configurable pause times, and avoids the catastrophic full-GC stalls that CMS was prone to under memory pressure. Understanding how G1GC works inside Elasticsearch helps you diagnose pause-related issues without resorting to cargo-cult JVM flag changes.

G1GC Concepts That Matter for Elasticsearch

G1GC divides the heap into equally sized regions (typically 1-32 MB each, auto-sized based on heap). Unlike older collectors that split the heap into fixed young and old generations, G1 assigns regions dynamically to eden, survivor, or old roles. This region-based layout lets G1 perform incremental collection - it identifies the regions with the most garbage (the "garbage-first" name) and collects those first during mixed collections.

Concurrent marking runs alongside application threads to identify live objects in old-generation regions. When concurrent marking completes, G1 performs mixed collections that evacuate live objects from selected old regions while also collecting young regions. The pause-time target (-XX:MaxGCPauseMillis, defaulting to 200ms in Elasticsearch) controls how many regions G1 attempts to collect per cycle. G1 trades throughput for pause predictability - it will collect fewer regions per pause to stay within the target.

Humongous allocations deserve special attention. Any object larger than 50% of a G1 region is allocated as a "humongous" object, consuming one or more contiguous regions. These regions sit outside normal collection cycles and are only reclaimed during concurrent marking or full GC. In Elasticsearch, large aggregation buffers, bulk indexing batches, and oversized field values can trigger humongous allocations. When humongous regions fragment the heap, G1 may resort to a full GC to reclaim them - exactly the kind of long pause you want to avoid.

Default JVM Options in Elasticsearch

Elasticsearch ships a jvm.options file under config/jvm.options with minimal GC configuration. The relevant defaults are:

-XX:+UseG1GC

Heap size is determined automatically based on available system memory and node roles. To override it, create a file in jvm.options.d/ (not by editing jvm.options directly) with matching -Xms and -Xmx values. GC logging is enabled by default:

-Xlog:gc*,gc+age=trace,safepoint:file=gc.log:utctime,level,pid,tags:filecount=32,filesize=64m

This writes up to 32 rotated GC log files at 64 MB each into the Elasticsearch logs directory. The JVM is also configured with -XX:+HeapDumpOnOutOfMemoryError and -XX:+ExitOnOutOfMemoryError, so an OOM kills the node immediately and leaves a heap dump for analysis. Elasticsearch intentionally keeps G1GC tuning flags minimal - no custom region sizes, no initiating heap occupancy overrides, no reserve percent changes.

Diagnosing GC Issues

Start with the _nodes/stats API rather than diving into raw GC logs. The JVM section exposes heap usage and GC collector statistics:

GET _nodes/stats?filter_path=nodes.*.jvm.mem,nodes.*.jvm.gc

Look at jvm.mem.heap_used_percent for current heap pressure. The jvm.gc.collectors.old object shows collection count and cumulative time spent in old-generation collections. A rising old-gen collection count or high cumulative collection time indicates the node is spending too much time in GC. Elasticsearch also logs warnings when GC overhead exceeds 50% of elapsed time, with messages like [gc][number] overhead, spent [21s] collecting in the last [40s].

For deeper analysis, examine gc.log in the logs directory. G1GC log entries show the type of collection (young, mixed, or full), pause duration, and region evacuation details. Look for To-space exhausted entries - these indicate G1 could not find enough free regions to copy surviving objects during evacuation, forcing a full GC. Also watch for Humongous Allocation log lines, which reveal large object allocations consuming dedicated regions.

Common GC Problems

Long full GC pauses from humongous allocations. When large objects consume humongous regions faster than concurrent marking can identify them as reclaimable, G1 falls back to a stop-the-world full GC. This manifests as multi-second pauses on nodes handling large bulk requests or running heavy aggregations. Reducing bulk batch sizes or breaking up large aggregation responses can help.

To-space exhaustion. This occurs when the heap is fragmented enough that G1 cannot find contiguous free regions for survivor space during evacuation. The GC log will show To-space exhausted followed by a full GC. This typically happens under sustained high allocation rates when the heap is already above 75-80% utilization. The fix is usually reducing heap pressure - fewer concurrent heavy operations, smaller bulk sizes, or more nodes to distribute the load.

The 31 GB compressed oops boundary. The JVM uses compressed ordinary object pointers (compressed oops) when the heap is below approximately 32 GB, storing 64-bit pointers as 32-bit values. This effectively gives you more usable heap per gigabyte of configured heap. Crossing above ~31.5 GB (the exact threshold varies by JVM version and OS) disables compressed oops, meaning you need roughly 37-38 GB of configured heap to get the same effective capacity as 31 GB with compressed oops. Elasticsearch defaults to a 31 GB maximum for this reason. Setting -Xmx to 32g or 33g actually gives you less usable heap than 31g. If you need more heap than 31 GB, jump to at least 38 GB to compensate for the pointer overhead, and only do this when the workload genuinely demands it.

When Tuning G1GC Settings Is Justified

Elastic explicitly discourages changing GC settings. The default configuration works well for the vast majority of workloads, and incorrect tuning causes more problems than it solves. Reducing MaxGCPauseMillis below 200ms, for example, makes G1 collect fewer regions per cycle, which can increase collection frequency without reducing total GC overhead. Setting -XX:G1HeapRegionSize manually can worsen humongous allocation behavior if sized incorrectly.

That said, there are narrow cases where tuning is justified. Nodes running with heaps beyond 31 GB (for very high cardinality aggregation workloads or ML jobs) may benefit from increasing -XX:G1HeapRegionSize to 16m or 32m to reduce the likelihood of humongous allocations, since the humongous threshold is 50% of the region size. Workloads with extremely high allocation rates - heavy ingest nodes processing thousands of bulk requests per second - might benefit from increasing -XX:G1ReservePercent above the default 10% to keep more free regions available for evacuation.

Before changing any GC flags, verify the problem is actually GC-related. High heap pressure is more often a symptom of too many shards, expensive queries, or insufficient node count than a misconfigured garbage collector. Adding nodes or reducing workload is almost always a better first step than tuning JVM flags.