Write thread pool rejections occur when Elasticsearch cannot keep up with indexing requests, causing the write queue to overflow. This results in EsRejectedExecutionException errors and failed indexing operations.
Understanding Write Thread Pool
Thread Pool Structure
- Size: Number of concurrent workers (default: number of processors)
- Queue: Buffer for pending requests (default: 10000)
- Rejection: When queue is full, new requests are rejected
Error Message
EsRejectedExecutionException: rejected execution of ...
on EsThreadPoolExecutor[write, queue capacity = 10000, ...]
Diagnosing Write Rejections
Check Thread Pool Status
GET /_cat/thread_pool/write?v&h=node_name,name,active,queue,rejected
Monitor Over Time
GET /_nodes/stats/thread_pool/write
Key metrics:
active: Currently executing threadsqueue: Requests waitingrejected: Cumulative rejectionscompleted: Successfully processed
Check Bulk Queue
GET /_cat/thread_pool/bulk?v # ES 6.x
GET /_cat/thread_pool/write?v # ES 7.x+
Common Causes and Solutions
Cause 1: Indexing Rate Too High
Symptoms:
- Consistent queue near maximum
- High rejection rate
- Multiple clients sending bulk requests
Solutions:
- Reduce client concurrency:
# Python example: Use connection pooling
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch(
max_retries=10,
retry_on_timeout=True
)
# Use streaming_bulk for backpressure
helpers.streaming_bulk(es, actions, chunk_size=500)
- Implement client-side throttling:
import time
from elasticsearch.exceptions import ConnectionError
def index_with_backoff(es, actions):
for success, info in helpers.streaming_bulk(es, actions):
if not success:
time.sleep(1) # Backoff on failures
Cause 2: Bulk Request Size Too Large
Symptoms:
- Large bulk requests taking too long
- Other requests queuing behind them
Solutions:
- Optimize bulk size (target 5-15 MB per request):
POST /_bulk
// Keep to 500-2000 documents per request
// Or 5-15 MB total payload
- Monitor bulk performance:
GET /_nodes/stats/indices/indexing
Cause 3: Slow Disk I/O
Symptoms:
- Rejections correlate with disk utilization
- Long indexing times
- High iowait
Solutions:
- Use SSDs
- Increase refresh interval
- Reduce merge thread count
Cause 4: Resource Contention
Symptoms:
- High CPU or memory usage during rejections
- GC pauses correlating with rejections
Solutions:
- Scale the cluster horizontally
- Ensure adequate heap
- Review memory-intensive operations
Cause 5: Mapping or Settings Issues
Symptoms:
- Rejections on specific indices
- Complex analyzers or mappings
Solutions:
- Simplify mappings
- Use keyword fields where appropriate
- Avoid dynamic mapping in production
Tuning Write Thread Pool
Increasing Queue Size (Temporary)
PUT /_cluster/settings
{
"transient": {
"thread_pool.write.queue_size": 20000
}
}
Warning: Larger queues increase memory usage and latency.
Increasing Thread Pool Size
For CPU-bound workloads:
# elasticsearch.yml
thread_pool.write.size: 16 # Adjust based on CPU cores
Optimal Bulk Settings
PUT /my-index/_settings
{
"index.refresh_interval": "30s",
"index.translog.durability": "async",
"index.translog.sync_interval": "30s"
}
Client-Side Best Practices
Implement Retry Logic
from elasticsearch.helpers import BulkIndexError
max_retries = 3
for attempt in range(max_retries):
try:
helpers.bulk(es, actions)
break
except BulkIndexError as e:
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
Use Bulk Helpers with Threading
from elasticsearch.helpers import parallel_bulk
# Automatically handles threading and batching
for success, info in parallel_bulk(es, actions, thread_count=4, chunk_size=500):
if not success:
print(f"Failed: {info}")
Monitor Rejection Errors
from elasticsearch.exceptions import TransportError
try:
response = es.bulk(body=actions)
if response['errors']:
for item in response['items']:
if 'error' in item.get('index', {}):
error = item['index']['error']
if 'rejected_execution' in str(error):
# Backoff and retry
pass
except TransportError as e:
if 'rejected_execution' in str(e):
# Full queue rejection
pass
Monitoring and Alerting
Key Metrics
| Metric | Warning | Critical |
|---|---|---|
| Queue size | > 50% | > 80% |
| Rejections/min | > 10 | > 100 |
| Active threads | = thread_pool.size | sustained |
Prometheus/Metrics Query
# Alert on rejection rate
rate(elasticsearch_thread_pool_rejected_count{name="write"}[5m]) > 1
Elasticsearch Monitoring
GET /_cat/thread_pool/write?v&format=json
Prevention Strategies
1. Right-Size Cluster
- Ensure adequate data nodes for indexing load
- Dedicated ingest nodes for heavy processing
- Plan for peak load, not average
2. Client-Side Rate Limiting
import ratelimit
@ratelimit.limits(calls=100, period=1) # 100 bulk requests per second
def send_bulk(es, actions):
return es.bulk(body=actions)
3. Queue Monitoring
Set up alerts before queues fill:
// Check queue percentage
GET /_cat/thread_pool/write?v&h=node_name,queue,queue_size
4. Load Testing
Before production:
- Test maximum indexing rate
- Identify bottleneck (CPU, disk, network)
- Establish baseline metrics
Recovery Procedure
When experiencing rejections:
Reduce incoming load:
- Pause non-critical indexing
- Reduce client concurrency
Check cluster health:
GET /_cluster/health GET /_cat/thread_pool/write?vIdentify bottleneck:
- CPU: Hot threads, scaling needed
- Disk: I/O optimization
- Memory: Heap tuning
Implement fixes:
- Short-term: Queue size, refresh interval
- Long-term: Scaling, client optimization
Monitor recovery:
GET /_nodes/stats/thread_pool/write