Elasticsearch Cluster Performance Checklist

Use this checklist to verify your Elasticsearch cluster is configured for optimal performance. Each item includes the rationale and how to verify compliance.

Hardware and Infrastructure

Storage

Using SSDs for data nodes
- HDDs significantly limit performance
- Verify: Check disk type on each node
Dedicated storage for Elasticsearch
- Avoid shared storage or network filesystems
- Exception: Frozen tier can use object storage
Adequate disk space with headroom
- Target: < 80% utilization
- Check: GET /_cat/allocation?v

Memory

Heap size is 50% of RAM (max 31 GB)
- Remaining memory for filesystem cache
- Check: GET /_nodes/stats/jvm

Important: Heap should be about half of RAM but never above 32 GB.

Min heap equals max heap (-Xms = -Xmx)
- Prevents heap resizing pauses
- Check: GET /_nodes/jvm
Memory lock enabled
- Prevents swapping
- bootstrap.memory_lock: true

CPU

Adequate cores for workload
- Minimum: 8 cores per data node
- Check: GET /_cat/nodes?v&h=name,cpu,load_1m

Network

Low latency between nodes (< 1ms)
- Same availability zone preferred
- Test: ping between nodes
Adequate bandwidth
- 10 Gbps recommended for large clusters

Cluster Configuration

Node Roles

Dedicated master nodes (for clusters > 5 data nodes)
- Minimum 3 master-eligible nodes
- node.roles: [master]
Appropriate node roles assigned
- Hot/warm/cold tiers configured
- Coordinating nodes for heavy query loads

Shard Configuration

Shard size between 10-50 GB
- Check: GET /_cat/shards?v&h=index,shard,store&s=store:desc
Total shards manageable
- < 1000 shards per node
- Check: GET /_cluster/stats?filter_path=indices.shards.total
Appropriate replica count
- Usually 1 for production
- Check: GET /_cat/indices?v&h=index,pri,rep

Discovery and Recovery

Discovery properly configured
```
discovery.seed_hosts: [...]
```

Recovery settings tuned

indices.recovery.max_bytes_per_sec
cluster.routing.allocation.node_concurrent_recoveries

Index Settings

Refresh Interval

Appropriate refresh interval
- Default 1s, increase for write-heavy workloads
- Check: GET /my-index/_settings?filter_path=*.settings.index.refresh_interval

Translog

Translog durability appropriate
- request for durability, async for performance
- Check: GET /my-index/_settings?filter_path=*.settings.index.translog

Replicas

Replicas configured for availability
- At least 1 for production
- 0 during bulk loading, then increase

Query and Indexing Performance

Query Optimization

Slow query logging enabled

PUT /my-index/_settings
{
  "index.search.slowlog.threshold.query.warn": "10s"
}

Using filters instead of queries where appropriate
- Filters are cached, queries are not
No deep pagination (use search_after)
Query timeouts configured

Indexing Optimization

Bulk API used for high-volume indexing
- Target 5-15 MB per bulk request
Indexing client has retry logic
- Handle 429 (too many requests) gracefully

Monitoring and Alerting

Essential Monitoring

Cluster health monitored
- Alert on yellow/red status
Node metrics collected
- CPU, memory, disk, network
JVM metrics tracked
- Heap usage, GC frequency/duration
Thread pool metrics
- Queue sizes, rejections

Alert Thresholds

Heap usage alert at 75% (warning), 85% (critical)
Disk usage alert at 75% (warning), 80% (critical)
CPU usage alert at 80% (warning), 90% (critical)
Thread pool rejections alert on any

Data Management

Index Lifecycle Management

ILM policy configured
- Rollover, shrink, delete phases
- Check: GET _ilm/policy
Data retention policy
- Delete old data automatically

Backup and Recovery

Snapshot repository configured
- Check: GET /_snapshot
Regular snapshots scheduled
- Check: GET /_slm/policy
Restore tested
- Verify backups actually work

Security

Authentication enabled
- Check: Security features active
TLS configured
- Transport and HTTP layers
Minimal privileges
- Users have only required permissions

Circuit Breakers

Circuit breakers configured

PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.total.limit": "70%"
  }
}

Real memory tracking enabled (7.x+)

"indices.breaker.total.use_real_memory": true

Performance Verification

Run These Checks

# Cluster health
GET /_cluster/health?pretty

# Node stats
GET /_cat/nodes?v&h=name,heap.percent,cpu,load_1m,disk.used_percent

# Thread pools
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected

# Hot threads
GET /_nodes/hot_threads

# Pending tasks
GET /_cluster/pending_tasks

Healthy Cluster Indicators

Metric	Healthy	Investigate
Cluster status	Green	Yellow/Red
Heap usage	< 75%	> 85%
CPU usage	< 70%	> 85%
Disk usage	< 80%	> 85%
GC time	< 5%	> 10%
Thread rejections	0	> 0
Pending tasks	0	> 10