Garbage collection (GC) overhead errors occur when the JVM spends too much time collecting garbage, leaving insufficient time for application processing. In Elasticsearch, this manifests as slow responses, timeouts, and potential node disconnections.
Understanding GC Overhead
What Causes GC Overhead
- High heap pressure: Too many objects being created
- Memory leaks: Objects not being released
- Undersized heap: Not enough memory for workload
- Excessive shard counts: Too much metadata
- Large aggregations: Creating many temporary objects
GC Overhead Symptoms
- Log entries:
[gc][warning]with long durations java.lang.OutOfMemoryError: GC overhead limit exceeded- Node becoming unresponsive during GC
- Cluster timeouts and node disconnections
Diagnosing GC Issues
Check GC Statistics
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc
Key metrics:
collection_count: Number of GC runscollection_time_in_millis: Total time spent in GC
Calculate GC Overhead
GC overhead = (GC time / Total time) * 100
If GC overhead exceeds 5-10%, you have a problem.
Review GC Logs
Enable GC logging:
# jvm.options.d/gc_logging.options
-Xlog:gc*:file=/var/log/elasticsearch/gc.log:time,pid,tags:filecount=10,filesize=64m
Check for Long Pauses
# Find GC pauses longer than 1 second
grep -E "pause.*[0-9]{4,}ms|GC.*[0-9]\.[0-9]{2}s" /var/log/elasticsearch/gc.log
Common Causes and Solutions
Cause 1: Heap Size Too Small
Diagnosis:
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.mem.heap_used_percent
If consistently > 85%, heap is too small.
Solution:
# jvm.options.d/heap.options
# Heap should be about half of RAM but never above 32 GB
-Xms16g
-Xmx16g
Cause 2: Too Many Shards
Diagnosis:
GET /_cluster/stats?filter_path=indices.shards.total
Solution: Reduce shard count:
- Consolidate small indices
- Use ILM to shrink old indices
- Increase shard size target
Cause 3: Large Fielddata Usage
Diagnosis:
GET /_nodes/stats/indices/fielddata?fields=*
Solution:
// Clear fielddata cache
POST /_cache/clear?fielddata=true
// Prevent fielddata on text fields
PUT /my-index/_mapping
{
"properties": {
"my_field": {
"type": "text",
"fielddata": false
}
}
}
Cause 4: Memory-Intensive Queries
Diagnosis: Check slow logs for expensive queries.
Solution:
- Reduce aggregation bucket sizes
- Add query timeouts
- Use
search_afterinstead of deep pagination - Limit result sizes
Cause 5: Suboptimal GC Configuration
Solution - Tune G1GC (default in modern versions):
# jvm.options.d/gc_tuning.options
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
-XX:+ParallelRefProcEnabled
GC Types and Their Impact
Young Generation GC (Minor GC)
- Frequency: Often (many times per minute)
- Duration: Short (milliseconds)
- Impact: Usually acceptable
Old Generation GC (Major GC / Full GC)
- Frequency: Less often
- Duration: Longer (can be seconds)
- Impact: Can cause timeouts
G1GC Mixed Collections
- Behavior: Incrementally collects old generation
- Advantage: Shorter pauses than full GC
- Monitoring: Watch for "pause" times in GC logs
Immediate Actions for GC Overhead
Step 1: Reduce Memory Pressure
POST /_cache/clear
Step 2: Cancel Resource-Intensive Tasks
GET /_tasks?detailed=true
POST /_tasks/{task_id}/_cancel
Step 3: Reduce Incoming Load
Temporarily pause bulk indexing or reduce query rate.
Step 4: Force Full GC (Use Sparingly)
This can help if memory is fragmented:
# Via JMX or:
kill -3 <elasticsearch_pid> # Generates heap dump but triggers GC
Long-Term Solutions
Optimize Heap Configuration
# jvm.options.d/heap.options
# Always set min and max equal
-Xms16g
-Xmx16g
Configure Circuit Breakers
Prevent operations that would cause GC issues:
PUT /_cluster/settings
{
"persistent": {
"indices.breaker.total.limit": "70%",
"indices.breaker.request.limit": "40%",
"indices.breaker.fielddata.limit": "40%"
}
}
Scale Horizontally
Add more nodes to distribute memory load.
Implement Query Governance
- Set default timeouts
- Limit aggregation sizes
- Validate queries before execution
Monitoring Best Practices
Key Metrics to Track
| Metric | Warning | Critical |
|---|---|---|
| Heap Used % | > 75% | > 85% |
| GC Overhead % | > 5% | > 10% |
| Old Gen GC Frequency | > 1/min | > 5/min |
| GC Pause Duration | > 1s | > 5s |
Set Up Alerts
# Prometheus example
elasticsearch_jvm_gc_collection_time_seconds > 5
elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85