This checklist provides Site Reliability Engineering (SRE) best practices for operating Elasticsearch clusters in production environments.
Pre-Production Checklist
Infrastructure
- Dedicated master nodes (minimum 3 for HA)
- Separate node roles (master, data, coordinating, ingest)
- SSDs for data nodes (NVMe preferred)
- Network redundancy between nodes
- Load balancer for client connections
- DNS/service discovery configured
Configuration
- Cluster name uniquely identified
- Node names follow naming convention
- Heap size = 50% RAM (max 31 GB)
- Memory lock enabled
- Swap disabled or minimized
- File descriptor limits increased (65536+)
- Virtual memory limits appropriate (
vm.max_map_count)
Security
- Authentication enabled
- TLS on transport and HTTP layers
- Role-based access control configured
- Audit logging enabled
- Network segmentation (management vs. data traffic)
- API keys for application access
Backup and Recovery
- Snapshot repository configured
- Automated snapshots scheduled
- Snapshot retention policy
- Restore process tested
- RTO/RPO documented and tested
Monitoring Checklist
Cluster Health Metrics
- Cluster status (green/yellow/red)
- Unassigned shards count
- Pending tasks count
- Active shards percentage
Node Metrics
- CPU usage per node
- Heap usage per node
- Disk usage per node
- GC frequency and duration
- Thread pool queue sizes and rejections
- Network I/O
Index Metrics
- Indexing rate (docs/sec)
- Search rate (queries/sec)
- Indexing latency
- Search latency (p50, p95, p99)
- Refresh time
Infrastructure Metrics
- System CPU
- System memory
- Disk I/O
- Network latency between nodes
Alerting Checklist
Critical Alerts (Page immediately)
- Cluster status RED
- Node down/unreachable
- Heap usage > 90% for 5+ minutes
- Disk usage > 90%
- All masters unreachable
- Circuit breaker trips
Warning Alerts (Investigate within hours)
- Cluster status YELLOW for 30+ minutes
- Heap usage > 80%
- Disk usage > 80%
- Thread pool rejections > 0
- GC time > 10% of total time
- High query latency (above SLA)
- Indexing rate drops significantly
Informational (Review daily/weekly)
- Slow query count
- Shard count growth
- Index growth rate
- Snapshot failures
Alert Configuration
Example Alert Thresholds
# Prometheus alerting rules example
groups:
- name: elasticsearch
rules:
- alert: ElasticsearchClusterRed
expr: elasticsearch_cluster_health_status{color="red"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Elasticsearch cluster is RED"
- alert: ElasticsearchHeapHigh
expr: elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Elasticsearch heap usage above 85%"
- alert: ElasticsearchDiskHigh
expr: elasticsearch_filesystem_data_free_bytes / elasticsearch_filesystem_data_size_bytes < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Elasticsearch disk usage above 85%"
Capacity Planning
Regular Reviews
- Weekly: Check growth trends
- Monthly: Capacity projection update
- Quarterly: Capacity planning review
Key Metrics to Track
- Data volume growth rate
- Query volume growth rate
- Resource utilization trends
- Cost per GB stored
- Cost per query
Capacity Thresholds
| Resource | Plan Expansion | Execute | Emergency |
|---|---|---|---|
| Disk | 70% | 80% | 85% |
| Heap | 70% | 80% | 85% |
| CPU | 60% | 75% | 85% |
| Shards/node | 500 | 750 | 900 |
Operational Runbooks
Rolling Restart Procedure
# For each node:
1. Disable shard allocation
PUT /_cluster/settings {"transient":{"cluster.routing.allocation.enable":"primaries"}}
2. Stop indexing (optional)
3. Sync flush (ES 7.x)
POST /_flush/synced
4. Stop node
systemctl stop elasticsearch
5. Perform maintenance
6. Start node
systemctl start elasticsearch
7. Wait for node to join
GET /_cat/nodes?v
8. Re-enable allocation
PUT /_cluster/settings {"transient":{"cluster.routing.allocation.enable":"all"}}
9. Wait for green
GET /_cluster/health?wait_for_status=green
10. Repeat for next node
Emergency Response
Cluster RED:
- Check which indices are red:
GET /_cat/indices?v&health=red - Check unassigned shards:
GET /_cluster/allocation/explain - Check node status:
GET /_cat/nodes?v - Review logs for errors
- Attempt recovery:
POST /_cluster/reroute?retry_failed=true
High Memory Pressure:
- Check heap:
GET /_nodes/stats/jvm - Clear caches:
POST /_cache/clear - Identify expensive operations:
GET /_tasks?detailed=true - Cancel if necessary:
POST /_tasks/{task_id}/_cancel - Scale if persistent issue
Change Management
Pre-Change Checklist
- Change documented and approved
- Rollback plan prepared
- Snapshot taken
- Monitoring dashboard ready
- On-call team notified
- Maintenance window scheduled
Post-Change Verification
- Cluster health green
- All nodes present
- No unexpected errors in logs
- Performance baseline maintained
- Monitoring alerts clear
Documentation Requirements
Required Documentation
- Architecture diagram
- Runbooks for common operations
- Incident response procedures
- Escalation contacts
- SLA definitions
- Capacity planning records
- Change history
Regular Updates
- Review documentation quarterly
- Update after incidents
- Update after architecture changes
Incident Management
Severity Definitions
| Severity | Description | Response Time |
|---|---|---|
| SEV1 | Total outage, data loss risk | Immediate |
| SEV2 | Partial outage, degraded performance | < 30 min |
| SEV3 | Minor issues, no user impact | < 4 hours |
| SEV4 | Informational, planning items | Next business day |
Post-Incident Review
- Timeline documented
- Root cause identified
- Action items created
- Monitoring gaps addressed
- Runbooks updated