Elasticsearch SRE Checklist for Production

This checklist provides Site Reliability Engineering (SRE) best practices for operating Elasticsearch clusters in production environments.

Pre-Production Checklist

Infrastructure

Dedicated master nodes (minimum 3 for HA)
Separate node roles (master, data, coordinating, ingest)
SSDs for data nodes (NVMe preferred)
Network redundancy between nodes
Load balancer for client connections
DNS/service discovery configured

Configuration

Cluster name uniquely identified
Node names follow naming convention
Heap size = 50% RAM (max 31 GB)
Memory lock enabled
Swap disabled or minimized
File descriptor limits increased (65536+)
Virtual memory limits appropriate (vm.max_map_count)

Security

Authentication enabled
TLS on transport and HTTP layers
Role-based access control configured
Audit logging enabled
Network segmentation (management vs. data traffic)
API keys for application access

Backup and Recovery

Snapshot repository configured
Automated snapshots scheduled
Snapshot retention policy
Restore process tested
RTO/RPO documented and tested

Monitoring Checklist

Cluster Health Metrics

Cluster status (green/yellow/red)
Unassigned shards count
Pending tasks count
Active shards percentage

Node Metrics

CPU usage per node
Heap usage per node
Disk usage per node
GC frequency and duration
Thread pool queue sizes and rejections
Network I/O

Index Metrics

Indexing rate (docs/sec)
Search rate (queries/sec)
Indexing latency
Search latency (p50, p95, p99)
Refresh time

Infrastructure Metrics

System CPU
System memory
Disk I/O
Network latency between nodes

Alerting Checklist

Critical Alerts (Page immediately)

Cluster status RED
Node down/unreachable
Heap usage > 90% for 5+ minutes
Disk usage > 90%
All masters unreachable
Circuit breaker trips

Warning Alerts (Investigate within hours)

Cluster status YELLOW for 30+ minutes
Heap usage > 80%
Disk usage > 80%
Thread pool rejections > 0
GC time > 10% of total time
High query latency (above SLA)
Indexing rate drops significantly

Informational (Review daily/weekly)

Slow query count
Shard count growth
Index growth rate
Snapshot failures

Alert Configuration

Example Alert Thresholds

# Prometheus alerting rules example
groups:
  - name: elasticsearch
    rules:
      - alert: ElasticsearchClusterRed
        expr: elasticsearch_cluster_health_status{color="red"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Elasticsearch cluster is RED"

      - alert: ElasticsearchHeapHigh
        expr: elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elasticsearch heap usage above 85%"

      - alert: ElasticsearchDiskHigh
        expr: elasticsearch_filesystem_data_free_bytes / elasticsearch_filesystem_data_size_bytes < 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elasticsearch disk usage above 85%"

Capacity Planning

Regular Reviews

Weekly: Check growth trends
Monthly: Capacity projection update
Quarterly: Capacity planning review

Key Metrics to Track

Data volume growth rate
Query volume growth rate
Resource utilization trends
Cost per GB stored
Cost per query

Capacity Thresholds

Resource	Plan Expansion	Execute	Emergency
Disk	70%	80%	85%
Heap	70%	80%	85%
CPU	60%	75%	85%
Shards/node	500	750	900

Operational Runbooks

Rolling Restart Procedure

# For each node:
1. Disable shard allocation
   PUT /_cluster/settings {"transient":{"cluster.routing.allocation.enable":"primaries"}}

2. Stop indexing (optional)

3. Sync flush (ES 7.x)
   POST /_flush/synced

4. Stop node
   systemctl stop elasticsearch

5. Perform maintenance

6. Start node
   systemctl start elasticsearch

7. Wait for node to join
   GET /_cat/nodes?v

8. Re-enable allocation
   PUT /_cluster/settings {"transient":{"cluster.routing.allocation.enable":"all"}}

9. Wait for green
   GET /_cluster/health?wait_for_status=green

10. Repeat for next node

Emergency Response

Cluster RED:

Check which indices are red: GET /_cat/indices?v&health=red
Check unassigned shards: GET /_cluster/allocation/explain
Check node status: GET /_cat/nodes?v
Review logs for errors
Attempt recovery: POST /_cluster/reroute?retry_failed=true

High Memory Pressure:

Check heap: GET /_nodes/stats/jvm
Clear caches: POST /_cache/clear
Identify expensive operations: GET /_tasks?detailed=true
Cancel if necessary: POST /_tasks/{task_id}/_cancel
Scale if persistent issue

Change Management

Pre-Change Checklist

Change documented and approved
Rollback plan prepared
Snapshot taken
Monitoring dashboard ready
On-call team notified
Maintenance window scheduled

Post-Change Verification

Cluster health green
All nodes present
No unexpected errors in logs
Performance baseline maintained
Monitoring alerts clear

Documentation Requirements

Required Documentation

Architecture diagram
Runbooks for common operations
Incident response procedures
Escalation contacts
SLA definitions
Capacity planning records
Change history

Regular Updates

Review documentation quarterly
Update after incidents
Update after architecture changes

Incident Management

Severity Definitions

Severity	Description	Response Time
SEV1	Total outage, data loss risk	Immediate
SEV2	Partial outage, degraded performance	< 30 min
SEV3	Minor issues, no user impact	< 4 hours
SEV4	Informational, planning items	Next business day