Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Threadpool Write Queue Rejected Execution

Write thread pool rejections occur when Elasticsearch cannot keep up with indexing requests, causing the write queue to overflow. This results in EsRejectedExecutionException errors and failed indexing operations.

Understanding Write Thread Pool

Thread Pool Structure

  • Size: Number of concurrent workers (default: number of processors)
  • Queue: Buffer for pending requests (default: 10000)
  • Rejection: When queue is full, new requests are rejected

Error Message

EsRejectedExecutionException: rejected execution of ...
on EsThreadPoolExecutor[write, queue capacity = 10000, ...]

Diagnosing Write Rejections

Check Thread Pool Status

GET /_cat/thread_pool/write?v&h=node_name,name,active,queue,rejected

Monitor Over Time

GET /_nodes/stats/thread_pool/write

Key metrics:

  • active: Currently executing threads
  • queue: Requests waiting
  • rejected: Cumulative rejections
  • completed: Successfully processed

Check Bulk Queue

GET /_cat/thread_pool/bulk?v  # ES 6.x
GET /_cat/thread_pool/write?v # ES 7.x+

Common Causes and Solutions

Cause 1: Indexing Rate Too High

Symptoms:

  • Consistent queue near maximum
  • High rejection rate
  • Multiple clients sending bulk requests

Solutions:

  1. Reduce client concurrency:
# Python example: Use connection pooling
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(
    max_retries=10,
    retry_on_timeout=True
)

# Use streaming_bulk for backpressure
helpers.streaming_bulk(es, actions, chunk_size=500)
  1. Implement client-side throttling:
import time
from elasticsearch.exceptions import ConnectionError

def index_with_backoff(es, actions):
    for success, info in helpers.streaming_bulk(es, actions):
        if not success:
            time.sleep(1)  # Backoff on failures

Cause 2: Bulk Request Size Too Large

Symptoms:

  • Large bulk requests taking too long
  • Other requests queuing behind them

Solutions:

  1. Optimize bulk size (target 5-15 MB per request):
POST /_bulk
// Keep to 500-2000 documents per request
// Or 5-15 MB total payload
  1. Monitor bulk performance:
GET /_nodes/stats/indices/indexing

Cause 3: Slow Disk I/O

Symptoms:

  • Rejections correlate with disk utilization
  • Long indexing times
  • High iowait

Solutions:

  • Use SSDs
  • Increase refresh interval
  • Reduce merge thread count

Cause 4: Resource Contention

Symptoms:

  • High CPU or memory usage during rejections
  • GC pauses correlating with rejections

Solutions:

  • Scale the cluster horizontally
  • Ensure adequate heap
  • Review memory-intensive operations

Cause 5: Mapping or Settings Issues

Symptoms:

  • Rejections on specific indices
  • Complex analyzers or mappings

Solutions:

  • Simplify mappings
  • Use keyword fields where appropriate
  • Avoid dynamic mapping in production

Tuning Write Thread Pool

Increasing Queue Size (Temporary)

PUT /_cluster/settings
{
  "transient": {
    "thread_pool.write.queue_size": 20000
  }
}

Warning: Larger queues increase memory usage and latency.

Increasing Thread Pool Size

For CPU-bound workloads:

# elasticsearch.yml
thread_pool.write.size: 16  # Adjust based on CPU cores

Optimal Bulk Settings

PUT /my-index/_settings
{
  "index.refresh_interval": "30s",
  "index.translog.durability": "async",
  "index.translog.sync_interval": "30s"
}

Client-Side Best Practices

Implement Retry Logic

from elasticsearch.helpers import BulkIndexError

max_retries = 3
for attempt in range(max_retries):
    try:
        helpers.bulk(es, actions)
        break
    except BulkIndexError as e:
        if attempt < max_retries - 1:
            time.sleep(2 ** attempt)  # Exponential backoff
        else:
            raise

Use Bulk Helpers with Threading

from elasticsearch.helpers import parallel_bulk

# Automatically handles threading and batching
for success, info in parallel_bulk(es, actions, thread_count=4, chunk_size=500):
    if not success:
        print(f"Failed: {info}")

Monitor Rejection Errors

from elasticsearch.exceptions import TransportError

try:
    response = es.bulk(body=actions)
    if response['errors']:
        for item in response['items']:
            if 'error' in item.get('index', {}):
                error = item['index']['error']
                if 'rejected_execution' in str(error):
                    # Backoff and retry
                    pass
except TransportError as e:
    if 'rejected_execution' in str(e):
        # Full queue rejection
        pass

Monitoring and Alerting

Key Metrics

Metric Warning Critical
Queue size > 50% > 80%
Rejections/min > 10 > 100
Active threads = thread_pool.size sustained

Prometheus/Metrics Query

# Alert on rejection rate
rate(elasticsearch_thread_pool_rejected_count{name="write"}[5m]) > 1

Elasticsearch Monitoring

GET /_cat/thread_pool/write?v&format=json

Prevention Strategies

1. Right-Size Cluster

  • Ensure adequate data nodes for indexing load
  • Dedicated ingest nodes for heavy processing
  • Plan for peak load, not average

2. Client-Side Rate Limiting

import ratelimit

@ratelimit.limits(calls=100, period=1)  # 100 bulk requests per second
def send_bulk(es, actions):
    return es.bulk(body=actions)

3. Queue Monitoring

Set up alerts before queues fill:

// Check queue percentage
GET /_cat/thread_pool/write?v&h=node_name,queue,queue_size

4. Load Testing

Before production:

  • Test maximum indexing rate
  • Identify bottleneck (CPU, disk, network)
  • Establish baseline metrics

Recovery Procedure

When experiencing rejections:

  1. Reduce incoming load:

    • Pause non-critical indexing
    • Reduce client concurrency
  2. Check cluster health:

    GET /_cluster/health
    GET /_cat/thread_pool/write?v
    
  3. Identify bottleneck:

    • CPU: Hot threads, scaling needed
    • Disk: I/O optimization
    • Memory: Heap tuning
  4. Implement fixes:

    • Short-term: Queue size, refresh interval
    • Long-term: Scaling, client optimization
  5. Monitor recovery:

    GET /_nodes/stats/thread_pool/write
    
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.