Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch GC Overhead Errors

Garbage collection (GC) overhead errors occur when the JVM spends too much time collecting garbage, leaving insufficient time for application processing. In Elasticsearch, this manifests as slow responses, timeouts, and potential node disconnections.

Understanding GC Overhead

What Causes GC Overhead

  • High heap pressure: Too many objects being created
  • Memory leaks: Objects not being released
  • Undersized heap: Not enough memory for workload
  • Excessive shard counts: Too much metadata
  • Large aggregations: Creating many temporary objects

GC Overhead Symptoms

  • Log entries: [gc][warning] with long durations
  • java.lang.OutOfMemoryError: GC overhead limit exceeded
  • Node becoming unresponsive during GC
  • Cluster timeouts and node disconnections

Diagnosing GC Issues

Check GC Statistics

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc

Key metrics:

  • collection_count: Number of GC runs
  • collection_time_in_millis: Total time spent in GC

Calculate GC Overhead

GC overhead = (GC time / Total time) * 100

If GC overhead exceeds 5-10%, you have a problem.

Review GC Logs

Enable GC logging:

# jvm.options.d/gc_logging.options
-Xlog:gc*:file=/var/log/elasticsearch/gc.log:time,pid,tags:filecount=10,filesize=64m

Check for Long Pauses

# Find GC pauses longer than 1 second
grep -E "pause.*[0-9]{4,}ms|GC.*[0-9]\.[0-9]{2}s" /var/log/elasticsearch/gc.log

Common Causes and Solutions

Cause 1: Heap Size Too Small

Diagnosis:

GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.mem.heap_used_percent

If consistently > 85%, heap is too small.

Solution:

# jvm.options.d/heap.options
# Heap should be about half of RAM but never above 32 GB
-Xms16g
-Xmx16g

Cause 2: Too Many Shards

Diagnosis:

GET /_cluster/stats?filter_path=indices.shards.total

Solution: Reduce shard count:

  • Consolidate small indices
  • Use ILM to shrink old indices
  • Increase shard size target

Cause 3: Large Fielddata Usage

Diagnosis:

GET /_nodes/stats/indices/fielddata?fields=*

Solution:

// Clear fielddata cache
POST /_cache/clear?fielddata=true

// Prevent fielddata on text fields
PUT /my-index/_mapping
{
  "properties": {
    "my_field": {
      "type": "text",
      "fielddata": false
    }
  }
}

Cause 4: Memory-Intensive Queries

Diagnosis: Check slow logs for expensive queries.

Solution:

  • Reduce aggregation bucket sizes
  • Add query timeouts
  • Use search_after instead of deep pagination
  • Limit result sizes

Cause 5: Suboptimal GC Configuration

Solution - Tune G1GC (default in modern versions):

# jvm.options.d/gc_tuning.options
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ParallelRefProcEnabled

GC Types and Their Impact

Young Generation GC (Minor GC)

  • Frequency: Often (many times per minute)
  • Duration: Short (milliseconds)
  • Impact: Usually acceptable

Old Generation GC (Major GC / Full GC)

  • Frequency: Less often
  • Duration: Longer (can be seconds)
  • Impact: Can cause timeouts

G1GC Mixed Collections

  • Behavior: Incrementally collects old generation
  • Advantage: Shorter pauses than full GC
  • Monitoring: Watch for "pause" times in GC logs

Immediate Actions for GC Overhead

Step 1: Reduce Memory Pressure

POST /_cache/clear

Step 2: Cancel Resource-Intensive Tasks

GET /_tasks?detailed=true
POST /_tasks/{task_id}/_cancel

Step 3: Reduce Incoming Load

Temporarily pause bulk indexing or reduce query rate.

Step 4: Force Full GC (Use Sparingly)

This can help if memory is fragmented:

# Via JMX or:
kill -3 <elasticsearch_pid>  # Generates heap dump but triggers GC

Long-Term Solutions

Optimize Heap Configuration

# jvm.options.d/heap.options
# Always set min and max equal
-Xms16g
-Xmx16g

Configure Circuit Breakers

Prevent operations that would cause GC issues:

PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.total.limit": "70%",
    "indices.breaker.request.limit": "40%",
    "indices.breaker.fielddata.limit": "40%"
  }
}

Scale Horizontally

Add more nodes to distribute memory load.

Implement Query Governance

  • Set default timeouts
  • Limit aggregation sizes
  • Validate queries before execution

Monitoring Best Practices

Key Metrics to Track

Metric Warning Critical
Heap Used % > 75% > 85%
GC Overhead % > 5% > 10%
Old Gen GC Frequency > 1/min > 5/min
GC Pause Duration > 1s > 5s

Set Up Alerts

# Prometheus example
elasticsearch_jvm_gc_collection_time_seconds > 5
elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.