Elasticsearch GC Overhead Errors

Garbage collection (GC) overhead errors occur when the JVM spends too much time collecting garbage, leaving insufficient time for application processing. In Elasticsearch, this manifests as slow responses, timeouts, and potential node disconnections.

Understanding GC Overhead

What Causes GC Overhead

High heap pressure: Too many objects being created
Memory leaks: Objects not being released
Undersized heap: Not enough memory for workload
Excessive shard counts: Too much metadata
Large aggregations: Creating many temporary objects

GC Overhead Symptoms

Log entries: [gc][warning] with long durations
java.lang.OutOfMemoryError: GC overhead limit exceeded
Node becoming unresponsive during GC
Cluster timeouts and node disconnections

Diagnosing GC Issues

Check GC Statistics

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Key metrics:

collection_count: Number of GC runs
collection_time_in_millis: Total time spent in GC

Calculate GC Overhead

GC overhead = (GC time / Total time) * 100

If GC overhead exceeds 5-10%, you have a problem.

Review GC Logs

Enable GC logging:

# jvm.options.d/gc_logging.options
-Xlog:gc*:file=/var/log/elasticsearch/gc.log:time,pid,tags:filecount=10,filesize=64m

Check for Long Pauses

# Find GC pauses longer than 1 second
grep -E "pause.*[0-9]{4,}ms|GC.*[0-9]\.[0-9]{2}s" /var/log/elasticsearch/gc.log

Common Causes and Solutions

Cause 1: Heap Size Too Small

Diagnosis:

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.mem.heap_used_percent

If consistently > 85%, heap is too small.

Solution:

# jvm.options.d/heap.options
# Heap should be about half of RAM but never above 32 GB
-Xms16g
-Xmx16g

Cause 2: Too Many Shards

Diagnosis:

GET /_cluster/stats?filter_path=indices.shards.total

Solution: Reduce shard count:

Consolidate small indices
Use ILM to shrink old indices
Increase shard size target

Cause 3: Large Fielddata Usage

Diagnosis:

GET /_nodes/stats/indices/fielddata?fields=*

Solution:

// Clear fielddata cache
POST /_cache/clear?fielddata=true

// Prevent fielddata on text fields
PUT /my-index/_mapping
{
  "properties": {
    "my_field": {
      "type": "text",
      "fielddata": false
    }
  }
}

Cause 4: Memory-Intensive Queries

Diagnosis: Check slow logs for expensive queries.

Solution:

Reduce aggregation bucket sizes
Add query timeouts
Use search_after instead of deep pagination
Limit result sizes

Cause 5: Suboptimal GC Configuration

Solution - Tune G1GC (default in modern versions):

# jvm.options.d/gc_tuning.options
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ParallelRefProcEnabled

GC Types and Their Impact

Young Generation GC (Minor GC)

Frequency: Often (many times per minute)
Duration: Short (milliseconds)
Impact: Usually acceptable

Old Generation GC (Major GC / Full GC)

Frequency: Less often
Duration: Longer (can be seconds)
Impact: Can cause timeouts

G1GC Mixed Collections

Behavior: Incrementally collects old generation
Advantage: Shorter pauses than full GC
Monitoring: Watch for "pause" times in GC logs

Immediate Actions for GC Overhead

Step 1: Reduce Memory Pressure

POST /_cache/clear

Step 2: Cancel Resource-Intensive Tasks

GET /_tasks?detailed=true
POST /_tasks/{task_id}/_cancel

Step 3: Reduce Incoming Load

Temporarily pause bulk indexing or reduce query rate.

Step 4: Force Full GC (Use Sparingly)

This can help if memory is fragmented:

# Via JMX or:
kill -3 <elasticsearch_pid>  # Generates heap dump but triggers GC

Long-Term Solutions

Optimize Heap Configuration

# jvm.options.d/heap.options
# Always set min and max equal
-Xms16g
-Xmx16g

Configure Circuit Breakers

Prevent operations that would cause GC issues:

PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.total.limit": "70%",
    "indices.breaker.request.limit": "40%",
    "indices.breaker.fielddata.limit": "40%"
  }
}

Scale Horizontally

Add more nodes to distribute memory load.

Implement Query Governance

Set default timeouts
Limit aggregation sizes
Validate queries before execution

Monitoring Best Practices

Key Metrics to Track

Metric	Warning	Critical
Heap Used %	> 75%	> 85%
GC Overhead %	> 5%	> 10%
Old Gen GC Frequency	> 1/min	> 5/min
GC Pause Duration	> 1s	> 5s

Set Up Alerts

# Prometheus example
elasticsearch_jvm_gc_collection_time_seconds > 5
elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85

Elasticsearch GC Overhead Errors

Understanding GC Overhead

What Causes GC Overhead

GC Overhead Symptoms

Diagnosing GC Issues

Check GC Statistics

Calculate GC Overhead

Review GC Logs

Check for Long Pauses

Common Causes and Solutions

Cause 1: Heap Size Too Small

Cause 2: Too Many Shards

Cause 3: Large Fielddata Usage

Cause 4: Memory-Intensive Queries

Cause 5: Suboptimal GC Configuration

GC Types and Their Impact

Young Generation GC (Minor GC)

Old Generation GC (Major GC / Full GC)

G1GC Mixed Collections

Immediate Actions for GC Overhead

Step 1: Reduce Memory Pressure

Step 2: Cancel Resource-Intensive Tasks

Step 3: Reduce Incoming Load

Step 4: Force Full GC (Use Sparingly)

Long-Term Solutions

Optimize Heap Configuration

Configure Circuit Breakers

Scale Horizontally

Implement Query Governance

Monitoring Best Practices

Key Metrics to Track

Set Up Alerts

Related Topics