Use this checklist to verify your Elasticsearch cluster is configured for optimal performance. Each item includes the rationale and how to verify compliance.
Hardware and Infrastructure
Storage
Using SSDs for data nodes
- HDDs significantly limit performance
- Verify: Check disk type on each node
Dedicated storage for Elasticsearch
- Avoid shared storage or network filesystems
- Exception: Frozen tier can use object storage
Adequate disk space with headroom
- Target: < 80% utilization
- Check:
GET /_cat/allocation?v
Memory
- Heap size is 50% of RAM (max 31 GB)
- Remaining memory for filesystem cache
- Check:
GET /_nodes/stats/jvm
Important: Heap should be about half of RAM but never above 32 GB.
Min heap equals max heap (-Xms = -Xmx)
- Prevents heap resizing pauses
- Check:
GET /_nodes/jvm
Memory lock enabled
- Prevents swapping
bootstrap.memory_lock: true
CPU
- Adequate cores for workload
- Minimum: 8 cores per data node
- Check:
GET /_cat/nodes?v&h=name,cpu,load_1m
Network
Low latency between nodes (< 1ms)
- Same availability zone preferred
- Test:
pingbetween nodes
Adequate bandwidth
- 10 Gbps recommended for large clusters
Cluster Configuration
Node Roles
Dedicated master nodes (for clusters > 5 data nodes)
- Minimum 3 master-eligible nodes
node.roles: [master]
Appropriate node roles assigned
- Hot/warm/cold tiers configured
- Coordinating nodes for heavy query loads
Shard Configuration
Shard size between 10-50 GB
- Check:
GET /_cat/shards?v&h=index,shard,store&s=store:desc
- Check:
Total shards manageable
- < 1000 shards per node
- Check:
GET /_cluster/stats?filter_path=indices.shards.total
Appropriate replica count
- Usually 1 for production
- Check:
GET /_cat/indices?v&h=index,pri,rep
Discovery and Recovery
Discovery properly configured
discovery.seed_hosts: [...]Recovery settings tuned
indices.recovery.max_bytes_per_sec cluster.routing.allocation.node_concurrent_recoveries
Index Settings
Refresh Interval
- Appropriate refresh interval
- Default 1s, increase for write-heavy workloads
- Check:
GET /my-index/_settings?filter_path=*.settings.index.refresh_interval
Translog
- Translog durability appropriate
requestfor durability,asyncfor performance- Check:
GET /my-index/_settings?filter_path=*.settings.index.translog
Replicas
- Replicas configured for availability
- At least 1 for production
- 0 during bulk loading, then increase
Query and Indexing Performance
Query Optimization
Slow query logging enabled
PUT /my-index/_settings { "index.search.slowlog.threshold.query.warn": "10s" }Using filters instead of queries where appropriate
- Filters are cached, queries are not
No deep pagination (use search_after)
Query timeouts configured
Indexing Optimization
Bulk API used for high-volume indexing
- Target 5-15 MB per bulk request
Indexing client has retry logic
- Handle 429 (too many requests) gracefully
Monitoring and Alerting
Essential Monitoring
Cluster health monitored
- Alert on yellow/red status
Node metrics collected
- CPU, memory, disk, network
JVM metrics tracked
- Heap usage, GC frequency/duration
Thread pool metrics
- Queue sizes, rejections
Alert Thresholds
- Heap usage alert at 75% (warning), 85% (critical)
- Disk usage alert at 75% (warning), 80% (critical)
- CPU usage alert at 80% (warning), 90% (critical)
- Thread pool rejections alert on any
Data Management
Index Lifecycle Management
ILM policy configured
- Rollover, shrink, delete phases
- Check:
GET _ilm/policy
Data retention policy
- Delete old data automatically
Backup and Recovery
Snapshot repository configured
- Check:
GET /_snapshot
- Check:
Regular snapshots scheduled
- Check:
GET /_slm/policy
- Check:
Restore tested
- Verify backups actually work
Security
Authentication enabled
- Check: Security features active
TLS configured
- Transport and HTTP layers
Minimal privileges
- Users have only required permissions
Circuit Breakers
Circuit breakers configured
PUT /_cluster/settings { "persistent": { "indices.breaker.total.limit": "70%" } }Real memory tracking enabled (7.x+)
"indices.breaker.total.use_real_memory": true
Performance Verification
Run These Checks
# Cluster health
GET /_cluster/health?pretty
# Node stats
GET /_cat/nodes?v&h=name,heap.percent,cpu,load_1m,disk.used_percent
# Thread pools
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected
# Hot threads
GET /_nodes/hot_threads
# Pending tasks
GET /_cluster/pending_tasks
Healthy Cluster Indicators
| Metric | Healthy | Investigate |
|---|---|---|
| Cluster status | Green | Yellow/Red |
| Heap usage | < 75% | > 85% |
| CPU usage | < 70% | > 85% |
| Disk usage | < 80% | > 85% |
| GC time | < 5% | > 10% |
| Thread rejections | 0 | > 0 |
| Pending tasks | 0 | > 10 |