Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Cluster Performance Checklist

Use this checklist to verify your Elasticsearch cluster is configured for optimal performance. Each item includes the rationale and how to verify compliance.

Hardware and Infrastructure

Storage

  • Using SSDs for data nodes

    • HDDs significantly limit performance
    • Verify: Check disk type on each node
  • Dedicated storage for Elasticsearch

    • Avoid shared storage or network filesystems
    • Exception: Frozen tier can use object storage
  • Adequate disk space with headroom

    • Target: < 80% utilization
    • Check: GET /_cat/allocation?v

Memory

  • Heap size is 50% of RAM (max 31 GB)
    • Remaining memory for filesystem cache
    • Check: GET /_nodes/stats/jvm

Important: Heap should be about half of RAM but never above 32 GB.

  • Min heap equals max heap (-Xms = -Xmx)

    • Prevents heap resizing pauses
    • Check: GET /_nodes/jvm
  • Memory lock enabled

    • Prevents swapping
    • bootstrap.memory_lock: true

CPU

  • Adequate cores for workload
    • Minimum: 8 cores per data node
    • Check: GET /_cat/nodes?v&h=name,cpu,load_1m

Network

  • Low latency between nodes (< 1ms)

    • Same availability zone preferred
    • Test: ping between nodes
  • Adequate bandwidth

    • 10 Gbps recommended for large clusters

Cluster Configuration

Node Roles

  • Dedicated master nodes (for clusters > 5 data nodes)

    • Minimum 3 master-eligible nodes
    • node.roles: [master]
  • Appropriate node roles assigned

    • Hot/warm/cold tiers configured
    • Coordinating nodes for heavy query loads

Shard Configuration

  • Shard size between 10-50 GB

    • Check: GET /_cat/shards?v&h=index,shard,store&s=store:desc
  • Total shards manageable

    • < 1000 shards per node
    • Check: GET /_cluster/stats?filter_path=indices.shards.total
  • Appropriate replica count

    • Usually 1 for production
    • Check: GET /_cat/indices?v&h=index,pri,rep

Discovery and Recovery

  • Discovery properly configured

    discovery.seed_hosts: [...]
    
  • Recovery settings tuned

    indices.recovery.max_bytes_per_sec
    cluster.routing.allocation.node_concurrent_recoveries
    

Index Settings

Refresh Interval

  • Appropriate refresh interval
    • Default 1s, increase for write-heavy workloads
    • Check: GET /my-index/_settings?filter_path=*.settings.index.refresh_interval

Translog

  • Translog durability appropriate
    • request for durability, async for performance
    • Check: GET /my-index/_settings?filter_path=*.settings.index.translog

Replicas

  • Replicas configured for availability
    • At least 1 for production
    • 0 during bulk loading, then increase

Query and Indexing Performance

Query Optimization

  • Slow query logging enabled

    PUT /my-index/_settings
    {
      "index.search.slowlog.threshold.query.warn": "10s"
    }
    
  • Using filters instead of queries where appropriate

    • Filters are cached, queries are not
  • No deep pagination (use search_after)

  • Query timeouts configured

Indexing Optimization

  • Bulk API used for high-volume indexing

    • Target 5-15 MB per bulk request
  • Indexing client has retry logic

    • Handle 429 (too many requests) gracefully

Monitoring and Alerting

Essential Monitoring

  • Cluster health monitored

    • Alert on yellow/red status
  • Node metrics collected

    • CPU, memory, disk, network
  • JVM metrics tracked

    • Heap usage, GC frequency/duration
  • Thread pool metrics

    • Queue sizes, rejections

Alert Thresholds

  • Heap usage alert at 75% (warning), 85% (critical)
  • Disk usage alert at 75% (warning), 80% (critical)
  • CPU usage alert at 80% (warning), 90% (critical)
  • Thread pool rejections alert on any

Data Management

Index Lifecycle Management

  • ILM policy configured

    • Rollover, shrink, delete phases
    • Check: GET _ilm/policy
  • Data retention policy

    • Delete old data automatically

Backup and Recovery

  • Snapshot repository configured

    • Check: GET /_snapshot
  • Regular snapshots scheduled

    • Check: GET /_slm/policy
  • Restore tested

    • Verify backups actually work

Security

  • Authentication enabled

    • Check: Security features active
  • TLS configured

    • Transport and HTTP layers
  • Minimal privileges

    • Users have only required permissions

Circuit Breakers

  • Circuit breakers configured

    PUT /_cluster/settings
    {
      "persistent": {
        "indices.breaker.total.limit": "70%"
      }
    }
    
  • Real memory tracking enabled (7.x+)

    "indices.breaker.total.use_real_memory": true
    

Performance Verification

Run These Checks

# Cluster health
GET /_cluster/health?pretty

# Node stats
GET /_cat/nodes?v&h=name,heap.percent,cpu,load_1m,disk.used_percent

# Thread pools
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected

# Hot threads
GET /_nodes/hot_threads

# Pending tasks
GET /_cluster/pending_tasks

Healthy Cluster Indicators

Metric Healthy Investigate
Cluster status Green Yellow/Red
Heap usage < 75% > 85%
CPU usage < 70% > 85%
Disk usage < 80% > 85%
GC time < 5% > 10%
Thread rejections 0 > 0
Pending tasks 0 > 10
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.