Elasticsearch Threadpool Write Queue Rejected Execution

Write thread pool rejections occur when Elasticsearch cannot keep up with indexing requests, causing the write queue to overflow. This results in EsRejectedExecutionException errors and failed indexing operations.

Understanding Write Thread Pool

Thread Pool Structure

Size: Number of concurrent workers (default: number of processors)
Queue: Buffer for pending requests (default: 10000)
Rejection: When queue is full, new requests are rejected

Error Message

EsRejectedExecutionException: rejected execution of ...
on EsThreadPoolExecutor[write, queue capacity = 10000, ...]

Diagnosing Write Rejections

Check Thread Pool Status

GET /_cat/thread_pool/write?v&h=node_name,name,active,queue,rejected

Monitor Over Time

GET /_nodes/stats/thread_pool/write

Key metrics:

active: Currently executing threads
queue: Requests waiting
rejected: Cumulative rejections
completed: Successfully processed

Check Bulk Queue

GET /_cat/thread_pool/bulk?v  # ES 6.x
GET /_cat/thread_pool/write?v # ES 7.x+

Common Causes and Solutions

Cause 1: Indexing Rate Too High

Symptoms:

Consistent queue near maximum
High rejection rate
Multiple clients sending bulk requests

Solutions:

Reduce client concurrency:

# Python example: Use connection pooling
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(
    max_retries=10,
    retry_on_timeout=True
)

# Use streaming_bulk for backpressure
helpers.streaming_bulk(es, actions, chunk_size=500)

Implement client-side throttling:

import time
from elasticsearch.exceptions import ConnectionError

def index_with_backoff(es, actions):
    for success, info in helpers.streaming_bulk(es, actions):
        if not success:
            time.sleep(1)  # Backoff on failures

Cause 2: Bulk Request Size Too Large

Symptoms:

Large bulk requests taking too long
Other requests queuing behind them

Solutions:

Optimize bulk size (target 5-15 MB per request):

POST /_bulk
// Keep to 500-2000 documents per request
// Or 5-15 MB total payload

Monitor bulk performance:

GET /_nodes/stats/indices/indexing

Cause 3: Slow Disk I/O

Symptoms:

Rejections correlate with disk utilization
Long indexing times
High iowait

Solutions:

Use SSDs
Increase refresh interval
Reduce merge thread count

Cause 4: Resource Contention

Symptoms:

High CPU or memory usage during rejections
GC pauses correlating with rejections

Solutions:

Scale the cluster horizontally
Ensure adequate heap
Review memory-intensive operations

Cause 5: Mapping or Settings Issues

Symptoms:

Rejections on specific indices
Complex analyzers or mappings

Solutions:

Simplify mappings
Use keyword fields where appropriate
Avoid dynamic mapping in production

Tuning Write Thread Pool

Increasing Queue Size (Temporary)

PUT /_cluster/settings
{
  "transient": {
    "thread_pool.write.queue_size": 20000
  }
}

Warning: Larger queues increase memory usage and latency.

Increasing Thread Pool Size

For CPU-bound workloads:

# elasticsearch.yml
thread_pool.write.size: 16  # Adjust based on CPU cores

Optimal Bulk Settings

PUT /my-index/_settings
{
  "index.refresh_interval": "30s",
  "index.translog.durability": "async",
  "index.translog.sync_interval": "30s"
}

Client-Side Best Practices

Implement Retry Logic

from elasticsearch.helpers import BulkIndexError

max_retries = 3
for attempt in range(max_retries):
    try:
        helpers.bulk(es, actions)
        break
    except BulkIndexError as e:
        if attempt < max_retries - 1:
            time.sleep(2 ** attempt)  # Exponential backoff
        else:
            raise

Use Bulk Helpers with Threading

from elasticsearch.helpers import parallel_bulk

# Automatically handles threading and batching
for success, info in parallel_bulk(es, actions, thread_count=4, chunk_size=500):
    if not success:
        print(f"Failed: {info}")

Monitor Rejection Errors

from elasticsearch.exceptions import TransportError

try:
    response = es.bulk(body=actions)
    if response['errors']:
        for item in response['items']:
            if 'error' in item.get('index', {}):
                error = item['index']['error']
                if 'rejected_execution' in str(error):
                    # Backoff and retry
                    pass
except TransportError as e:
    if 'rejected_execution' in str(e):
        # Full queue rejection
        pass

Monitoring and Alerting

Key Metrics

Metric	Warning	Critical
Queue size	> 50%	> 80%
Rejections/min	> 10	> 100
Active threads	= thread_pool.size	sustained

Prometheus/Metrics Query

# Alert on rejection rate
rate(elasticsearch_thread_pool_rejected_count{name="write"}[5m]) > 1

Elasticsearch Monitoring

GET /_cat/thread_pool/write?v&format=json

Prevention Strategies

1. Right-Size Cluster

Ensure adequate data nodes for indexing load
Dedicated ingest nodes for heavy processing
Plan for peak load, not average

2. Client-Side Rate Limiting

import ratelimit

@ratelimit.limits(calls=100, period=1)  # 100 bulk requests per second
def send_bulk(es, actions):
    return es.bulk(body=actions)

3. Queue Monitoring

Set up alerts before queues fill:

// Check queue percentage
GET /_cat/thread_pool/write?v&h=node_name,queue,queue_size

4. Load Testing

Before production:

Test maximum indexing rate
Identify bottleneck (CPU, disk, network)
Establish baseline metrics

Recovery Procedure

When experiencing rejections:

Reduce incoming load:
- Pause non-critical indexing
- Reduce client concurrency

Check cluster health:

GET /_cluster/health
GET /_cat/thread_pool/write?v

Identify bottleneck:
- CPU: Hot threads, scaling needed
- Disk: I/O optimization
- Memory: Heap tuning
Implement fixes:
- Short-term: Queue size, refresh interval
- Long-term: Scaling, client optimization
Monitor recovery:
```
GET /_nodes/stats/thread_pool/write
```

Elasticsearch Threadpool Write Queue Rejected Execution

Understanding Write Thread Pool

Thread Pool Structure

Error Message

Diagnosing Write Rejections

Check Thread Pool Status

Monitor Over Time

Check Bulk Queue

Common Causes and Solutions

Cause 1: Indexing Rate Too High

Cause 2: Bulk Request Size Too Large

Cause 3: Slow Disk I/O

Cause 4: Resource Contention

Cause 5: Mapping or Settings Issues

Tuning Write Thread Pool

Increasing Queue Size (Temporary)

Increasing Thread Pool Size

Optimal Bulk Settings

Client-Side Best Practices

Implement Retry Logic

Use Bulk Helpers with Threading

Monitor Rejection Errors

Monitoring and Alerting

Key Metrics

Prometheus/Metrics Query

Elasticsearch Monitoring

Prevention Strategies

1. Right-Size Cluster

2. Client-Side Rate Limiting

3. Queue Monitoring

4. Load Testing

Recovery Procedure

Related Topics