Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch SRE Checklist for Production

This checklist provides Site Reliability Engineering (SRE) best practices for operating Elasticsearch clusters in production environments.

Pre-Production Checklist

Infrastructure

  • Dedicated master nodes (minimum 3 for HA)
  • Separate node roles (master, data, coordinating, ingest)
  • SSDs for data nodes (NVMe preferred)
  • Network redundancy between nodes
  • Load balancer for client connections
  • DNS/service discovery configured

Configuration

  • Cluster name uniquely identified
  • Node names follow naming convention
  • Heap size = 50% RAM (max 31 GB)
  • Memory lock enabled
  • Swap disabled or minimized
  • File descriptor limits increased (65536+)
  • Virtual memory limits appropriate (vm.max_map_count)

Security

  • Authentication enabled
  • TLS on transport and HTTP layers
  • Role-based access control configured
  • Audit logging enabled
  • Network segmentation (management vs. data traffic)
  • API keys for application access

Backup and Recovery

  • Snapshot repository configured
  • Automated snapshots scheduled
  • Snapshot retention policy
  • Restore process tested
  • RTO/RPO documented and tested

Monitoring Checklist

Cluster Health Metrics

  • Cluster status (green/yellow/red)
  • Unassigned shards count
  • Pending tasks count
  • Active shards percentage

Node Metrics

  • CPU usage per node
  • Heap usage per node
  • Disk usage per node
  • GC frequency and duration
  • Thread pool queue sizes and rejections
  • Network I/O

Index Metrics

  • Indexing rate (docs/sec)
  • Search rate (queries/sec)
  • Indexing latency
  • Search latency (p50, p95, p99)
  • Refresh time

Infrastructure Metrics

  • System CPU
  • System memory
  • Disk I/O
  • Network latency between nodes

Alerting Checklist

Critical Alerts (Page immediately)

  • Cluster status RED
  • Node down/unreachable
  • Heap usage > 90% for 5+ minutes
  • Disk usage > 90%
  • All masters unreachable
  • Circuit breaker trips

Warning Alerts (Investigate within hours)

  • Cluster status YELLOW for 30+ minutes
  • Heap usage > 80%
  • Disk usage > 80%
  • Thread pool rejections > 0
  • GC time > 10% of total time
  • High query latency (above SLA)
  • Indexing rate drops significantly

Informational (Review daily/weekly)

  • Slow query count
  • Shard count growth
  • Index growth rate
  • Snapshot failures

Alert Configuration

Example Alert Thresholds

# Prometheus alerting rules example
groups:
  - name: elasticsearch
    rules:
      - alert: ElasticsearchClusterRed
        expr: elasticsearch_cluster_health_status{color="red"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Elasticsearch cluster is RED"

      - alert: ElasticsearchHeapHigh
        expr: elasticsearch_jvm_memory_used_bytes / elasticsearch_jvm_memory_max_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elasticsearch heap usage above 85%"

      - alert: ElasticsearchDiskHigh
        expr: elasticsearch_filesystem_data_free_bytes / elasticsearch_filesystem_data_size_bytes < 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elasticsearch disk usage above 85%"

Capacity Planning

Regular Reviews

  • Weekly: Check growth trends
  • Monthly: Capacity projection update
  • Quarterly: Capacity planning review

Key Metrics to Track

  • Data volume growth rate
  • Query volume growth rate
  • Resource utilization trends
  • Cost per GB stored
  • Cost per query

Capacity Thresholds

Resource Plan Expansion Execute Emergency
Disk 70% 80% 85%
Heap 70% 80% 85%
CPU 60% 75% 85%
Shards/node 500 750 900

Operational Runbooks

Rolling Restart Procedure

# For each node:
1. Disable shard allocation
   PUT /_cluster/settings {"transient":{"cluster.routing.allocation.enable":"primaries"}}

2. Stop indexing (optional)

3. Sync flush (ES 7.x)
   POST /_flush/synced

4. Stop node
   systemctl stop elasticsearch

5. Perform maintenance

6. Start node
   systemctl start elasticsearch

7. Wait for node to join
   GET /_cat/nodes?v

8. Re-enable allocation
   PUT /_cluster/settings {"transient":{"cluster.routing.allocation.enable":"all"}}

9. Wait for green
   GET /_cluster/health?wait_for_status=green

10. Repeat for next node

Emergency Response

Cluster RED:

  1. Check which indices are red: GET /_cat/indices?v&health=red
  2. Check unassigned shards: GET /_cluster/allocation/explain
  3. Check node status: GET /_cat/nodes?v
  4. Review logs for errors
  5. Attempt recovery: POST /_cluster/reroute?retry_failed=true

High Memory Pressure:

  1. Check heap: GET /_nodes/stats/jvm
  2. Clear caches: POST /_cache/clear
  3. Identify expensive operations: GET /_tasks?detailed=true
  4. Cancel if necessary: POST /_tasks/{task_id}/_cancel
  5. Scale if persistent issue

Change Management

Pre-Change Checklist

  • Change documented and approved
  • Rollback plan prepared
  • Snapshot taken
  • Monitoring dashboard ready
  • On-call team notified
  • Maintenance window scheduled

Post-Change Verification

  • Cluster health green
  • All nodes present
  • No unexpected errors in logs
  • Performance baseline maintained
  • Monitoring alerts clear

Documentation Requirements

Required Documentation

  • Architecture diagram
  • Runbooks for common operations
  • Incident response procedures
  • Escalation contacts
  • SLA definitions
  • Capacity planning records
  • Change history

Regular Updates

  • Review documentation quarterly
  • Update after incidents
  • Update after architecture changes

Incident Management

Severity Definitions

Severity Description Response Time
SEV1 Total outage, data loss risk Immediate
SEV2 Partial outage, degraded performance < 30 min
SEV3 Minor issues, no user impact < 4 hours
SEV4 Informational, planning items Next business day

Post-Incident Review

  • Timeline documented
  • Root cause identified
  • Action items created
  • Monitoring gaps addressed
  • Runbooks updated
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.