Checking Elasticsearch Cluster Health: A Comprehensive Guide

Checking Elasticsearch cluster health is crucial for maintaining a robust and efficient search infrastructure. You should perform health checks:

  1. Regularly as part of routine maintenance
  2. When experiencing performance issues
  3. After making configuration changes
  4. Before and after scaling operations
  5. During troubleshooting processes

What Constitutes Good Cluster Health

A healthy Elasticsearch cluster exhibits several key characteristics that work together to ensure reliable operation:

Cluster Status

The most visible health indicator is the cluster status color:

Green Status - Optimal health where:

  • All primary shards are allocated and active
  • All replica shards are allocated and active
  • Data is fully redundant across the cluster
  • Search and indexing operations perform optimally

Yellow Status - Functional but vulnerable:

  • All primary shards are allocated
  • Some replica shards are unallocated
  • Data is accessible but lacks full redundancy
  • Risk of data loss if a node fails
  • Performance may be impacted during node failures

Red Status - Critical issues:

  • One or more primary shards are unallocated
  • Data is incomplete or inaccessible
  • Search queries may return partial results
  • Indexing to affected indices may fail
  • Immediate attention required

Key Health Metrics

Beyond the status color, several metrics indicate cluster health:

Shard Allocation

  • All shards should be assigned to nodes
  • Even distribution across nodes prevents hotspots
  • Appropriate replica configuration for fault tolerance

Node Availability

  • All expected nodes are present and responsive
  • No nodes in the process of leaving or joining
  • Stable cluster membership over time

Resource Utilization

  • CPU usage below 80% on all nodes
  • Heap memory usage below 75% consistently
  • Disk space with at least 15-20% free (above high watermark)
  • JVM garbage collection pauses under 1 second

Performance Indicators

  • Query latency within acceptable thresholds
  • Indexing throughput meeting requirements
  • No persistent thread pool rejections
  • Search queue sizes remain manageable

Cluster Operations

  • No ongoing shard relocations or recoveries (unless planned)
  • No pending tasks accumulating in the cluster state
  • Snapshot and restore operations completing successfully

Standard APIs for Cluster Health Monitoring

Elasticsearch provides several built-in APIs to assess and monitor cluster health:

Cluster Health API

The primary API for checking overall cluster status:

Basic Health Check

GET /_cluster/health

Response includes:

  • status: green, yellow, or red
  • number_of_nodes: Total nodes in cluster
  • active_primary_shards: Count of active primary shards
  • active_shards: Total active shards
  • relocating_shards: Shards currently moving between nodes
  • initializing_shards: Shards being initialized
  • unassigned_shards: Shards not yet allocated

Index-Level Health

GET /_cluster/health?level=indices

Shows health status for each index individually, helping identify which indices have issues.

Shard-Level Health

GET /_cluster/health?level=shards

Provides detailed shard-by-shard health information for deep troubleshooting.

Wait for Status

GET /_cluster/health?wait_for_status=green&timeout=30s

Blocks until cluster reaches specified status or timeout expires, useful for automation and deployment scripts.

Cat APIs for Health Monitoring

The Cat APIs provide human-readable output for quick health checks:

Cat Health

GET /_cat/health?v

Compact view of cluster health with timestamp, perfect for scripting and regular monitoring.

Cat Nodes

GET /_cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,disk.used_percent

Shows resource utilization across all nodes to identify overloaded systems.

Cat Shards

GET /_cat/shards?v&h=index,shard,prirep,state,docs,store,node&s=store:desc

Lists all shards with their status, helping identify unassigned or problematic shards.

Cat Indices

GET /_cat/indices?v&health=yellow&health=red

Shows only indices that are not in green health status.

Cluster Stats API

Comprehensive cluster-wide statistics:

GET /_cluster/stats

Returns detailed information about:

  • Node roles and versions
  • Index and shard counts
  • Document counts and storage size
  • JVM versions and memory usage
  • Plugin information

Nodes Stats API

Detailed statistics for each node:

GET /_nodes/stats

Provides extensive metrics including:

  • JVM heap and garbage collection
  • Thread pools and rejections
  • File system and disk I/O
  • HTTP and transport layer stats
  • Indexing and search performance

Specific Metrics

GET /_nodes/stats/jvm,process,fs

Filter to only the metrics you need for faster responses.

Task Management API

Monitor long-running operations:

GET /_tasks?detailed=true&group_by=parents

Shows currently executing tasks like:

  • Ongoing searches
  • Indexing operations
  • Snapshot creation
  • Cluster state updates

Pending Tasks API

Identify cluster state update bottlenecks:

GET /_cluster/pending_tasks

Returns tasks waiting to be processed by the master node, which can indicate cluster state update issues.

Automated Health Monitoring with Pulse

While the standard APIs provide comprehensive data, interpreting them correctly and monitoring them continuously requires significant effort. Pulse continuously monitors your Elasticsearch and OpenSearch clusters with automated health checks that detect issues before they impact your operations. Get real-time visibility into cluster performance, resource utilization, and potential bottlenecks.

Health Assessments

Pulse provides proactive insights for optimal cluster health and performance through automated assessments:

Prevent Problems Before They Happen Avoid costly downtime and enjoy a seamless user experience with Pulse's proactive issue identification. Rather than reacting to failures, Pulse detects emerging issues like:

  • Increasing heap pressure before OutOfMemory errors
  • Disk space trends before watermark thresholds
  • Shard allocation imbalances before performance degradation
  • Query performance degradation patterns

Customized Health Metrics Get clear, actionable insights tailored to your specific cluster setup and performance goals. Pulse understands:

  • Your cluster topology and configuration
  • Expected workload patterns
  • Custom index settings and mappings
  • Application-specific performance requirements

Scale with Confidence Scale smoothly and keep your clusters healthy and up-to-date with daily checks and proactive monitoring. Pulse helps you:

  • Identify capacity constraints before they become critical
  • Understand resource utilization trends
  • Plan scaling operations with data-driven insights
  • Monitor cluster health during and after scaling events

Unlike manual monitoring with standard APIs, Pulse provides continuous automated assessments, intelligent alerting, and historical trend analysis to help you maintain optimal cluster health effortlessly.

Best Practices and Additional Information

  • Set up alerts for status changes, especially transitions to yellow or red
  • Regularly review cluster settings and shard allocation
  • Monitor node performance and resource utilization
  • Keep Elasticsearch and plugins up to date
  • Implement proper backup and recovery strategies

Frequently Asked Questions

Q: How often should I check my Elasticsearch cluster health?
A: It's recommended to set up continuous monitoring with alerts. However, manual checks should be performed at least daily, and more frequently during peak usage periods or after significant changes. Automated monitoring tools like Pulse can provide continuous health assessments without manual intervention.

Q: What does a yellow status mean, and is it a cause for concern?
A: A yellow status indicates that all primary shards are allocated, but some replica shards are not. While not as critical as a red status, it should be investigated promptly to ensure data redundancy and optimal performance. A prolonged yellow status leaves your cluster vulnerable to data loss if a node fails.

Q: Can cluster health impact search performance?
A: Yes, poor cluster health can significantly impact search performance. A red status, in particular, can lead to incomplete search results and increased query times. Even yellow status can affect performance during node failures, as the cluster lacks full redundancy to handle failovers smoothly.

Q: How can I improve my cluster's health from yellow to green?
A: To improve from yellow to green, ensure that there are enough nodes to allocate all replica shards, check for any shard allocation issues using the Allocation Explain API, and verify that there's sufficient disk space on all nodes. Also check cluster settings that might be preventing shard allocation.

Q: What steps should I take if my cluster status turns red?
A: If your cluster status turns red, immediately investigate which primary shards are unallocated using GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason, check node status and logs, ensure adequate resources are available, and consider restoring from a backup if data loss has occurred. Use the Allocation Explain API to understand why shards aren't being allocated.

Q: What's the difference between the Cluster Health API and the Cat Health API?
A: The Cluster Health API returns detailed JSON responses with comprehensive metrics, ideal for programmatic monitoring and automation. The Cat Health API provides compact, human-readable output perfect for quick manual checks and shell scripts. Both provide the same underlying health information.

Q: How do I monitor cluster health in production environments?
A: Production clusters should have continuous monitoring through tools that aggregate metrics from the various health APIs. Set up alerts for status changes, resource thresholds, and performance degradation. Many teams use dedicated monitoring solutions that provide automated health checks, trend analysis, and proactive alerting.

Q: What resource metrics should I monitor alongside cluster health?
A: Monitor heap memory usage (should stay below 75%), CPU utilization (target below 80%), disk space (maintain at least 15-20% free), JVM garbage collection pauses (should be under 1 second), thread pool rejections (should be zero or minimal), and query/indexing latency compared to your baselines.

Q: Can a green cluster still have performance issues?
A: Yes, a green status only indicates that all shards are allocated. You can still experience performance issues due to resource constraints, inefficient queries, poor index design, or hardware limitations. Comprehensive monitoring should include performance metrics beyond just shard allocation status.

Q: How do pending tasks affect cluster health?
A: Pending tasks accumulating in the cluster state indicate that the master node is struggling to process updates. This can lead to delays in shard allocation, index creation, and cluster configuration changes. High pending task counts often signal resource constraints on the master node or overly frequent cluster state updates.

Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.