Checking Elasticsearch cluster health is crucial for maintaining a robust and efficient search infrastructure. You should perform health checks:
- Regularly as part of routine maintenance
- When experiencing performance issues
- After making configuration changes
- Before and after scaling operations
- During troubleshooting processes
What Constitutes Good Cluster Health
A healthy Elasticsearch cluster exhibits several key characteristics that work together to ensure reliable operation:
Cluster Status
The most visible health indicator is the cluster status color:
Green Status - Optimal health where:
- All primary shards are allocated and active
- All replica shards are allocated and active
- Data is fully redundant across the cluster
- Search and indexing operations perform optimally
Yellow Status - Functional but vulnerable:
- All primary shards are allocated
- Some replica shards are unallocated
- Data is accessible but lacks full redundancy
- Risk of data loss if a node fails
- Performance may be impacted during node failures
Red Status - Critical issues:
- One or more primary shards are unallocated
- Data is incomplete or inaccessible
- Search queries may return partial results
- Indexing to affected indices may fail
- Immediate attention required
Key Health Metrics
Beyond the status color, several metrics indicate cluster health:
Shard Allocation
- All shards should be assigned to nodes
- Even distribution across nodes prevents hotspots
- Appropriate replica configuration for fault tolerance
Node Availability
- All expected nodes are present and responsive
- No nodes in the process of leaving or joining
- Stable cluster membership over time
Resource Utilization
- CPU usage below 80% on all nodes
- Heap memory usage below 75% consistently
- Disk space with at least 15-20% free (above high watermark)
- JVM garbage collection pauses under 1 second
Performance Indicators
- Query latency within acceptable thresholds
- Indexing throughput meeting requirements
- No persistent thread pool rejections
- Search queue sizes remain manageable
Cluster Operations
- No ongoing shard relocations or recoveries (unless planned)
- No pending tasks accumulating in the cluster state
- Snapshot and restore operations completing successfully
Standard APIs for Cluster Health Monitoring
Elasticsearch provides several built-in APIs to assess and monitor cluster health:
Cluster Health API
The primary API for checking overall cluster status:
Basic Health Check
GET /_cluster/health
Response includes:
status: green, yellow, or rednumber_of_nodes: Total nodes in clusteractive_primary_shards: Count of active primary shardsactive_shards: Total active shardsrelocating_shards: Shards currently moving between nodesinitializing_shards: Shards being initializedunassigned_shards: Shards not yet allocated
Index-Level Health
GET /_cluster/health?level=indices
Shows health status for each index individually, helping identify which indices have issues.
Shard-Level Health
GET /_cluster/health?level=shards
Provides detailed shard-by-shard health information for deep troubleshooting.
Wait for Status
GET /_cluster/health?wait_for_status=green&timeout=30s
Blocks until cluster reaches specified status or timeout expires, useful for automation and deployment scripts.
Cat APIs for Health Monitoring
The Cat APIs provide human-readable output for quick health checks:
Cat Health
GET /_cat/health?v
Compact view of cluster health with timestamp, perfect for scripting and regular monitoring.
Cat Nodes
GET /_cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,disk.used_percent
Shows resource utilization across all nodes to identify overloaded systems.
Cat Shards
GET /_cat/shards?v&h=index,shard,prirep,state,docs,store,node&s=store:desc
Lists all shards with their status, helping identify unassigned or problematic shards.
Cat Indices
GET /_cat/indices?v&health=yellow&health=red
Shows only indices that are not in green health status.
Cluster Stats API
Comprehensive cluster-wide statistics:
GET /_cluster/stats
Returns detailed information about:
- Node roles and versions
- Index and shard counts
- Document counts and storage size
- JVM versions and memory usage
- Plugin information
Nodes Stats API
Detailed statistics for each node:
GET /_nodes/stats
Provides extensive metrics including:
- JVM heap and garbage collection
- Thread pools and rejections
- File system and disk I/O
- HTTP and transport layer stats
- Indexing and search performance
Specific Metrics
GET /_nodes/stats/jvm,process,fs
Filter to only the metrics you need for faster responses.
Task Management API
Monitor long-running operations:
GET /_tasks?detailed=true&group_by=parents
Shows currently executing tasks like:
- Ongoing searches
- Indexing operations
- Snapshot creation
- Cluster state updates
Pending Tasks API
Identify cluster state update bottlenecks:
GET /_cluster/pending_tasks
Returns tasks waiting to be processed by the master node, which can indicate cluster state update issues.
Automated Health Monitoring with Pulse
While the standard APIs provide comprehensive data, interpreting them correctly and monitoring them continuously requires significant effort. Pulse continuously monitors your Elasticsearch and OpenSearch clusters with automated health checks that detect issues before they impact your operations. Get real-time visibility into cluster performance, resource utilization, and potential bottlenecks.
Health Assessments
Pulse provides proactive insights for optimal cluster health and performance through automated assessments:
Prevent Problems Before They Happen Avoid costly downtime and enjoy a seamless user experience with Pulse's proactive issue identification. Rather than reacting to failures, Pulse detects emerging issues like:
- Increasing heap pressure before OutOfMemory errors
- Disk space trends before watermark thresholds
- Shard allocation imbalances before performance degradation
- Query performance degradation patterns
Customized Health Metrics Get clear, actionable insights tailored to your specific cluster setup and performance goals. Pulse understands:
- Your cluster topology and configuration
- Expected workload patterns
- Custom index settings and mappings
- Application-specific performance requirements
Scale with Confidence Scale smoothly and keep your clusters healthy and up-to-date with daily checks and proactive monitoring. Pulse helps you:
- Identify capacity constraints before they become critical
- Understand resource utilization trends
- Plan scaling operations with data-driven insights
- Monitor cluster health during and after scaling events
Unlike manual monitoring with standard APIs, Pulse provides continuous automated assessments, intelligent alerting, and historical trend analysis to help you maintain optimal cluster health effortlessly.
Best Practices and Additional Information
- Set up alerts for status changes, especially transitions to yellow or red
- Regularly review cluster settings and shard allocation
- Monitor node performance and resource utilization
- Keep Elasticsearch and plugins up to date
- Implement proper backup and recovery strategies
Frequently Asked Questions
Q: How often should I check my Elasticsearch cluster health?
A: It's recommended to set up continuous monitoring with alerts. However, manual checks should be performed at least daily, and more frequently during peak usage periods or after significant changes. Automated monitoring tools like Pulse can provide continuous health assessments without manual intervention.
Q: What does a yellow status mean, and is it a cause for concern?
A: A yellow status indicates that all primary shards are allocated, but some replica shards are not. While not as critical as a red status, it should be investigated promptly to ensure data redundancy and optimal performance. A prolonged yellow status leaves your cluster vulnerable to data loss if a node fails.
Q: Can cluster health impact search performance?
A: Yes, poor cluster health can significantly impact search performance. A red status, in particular, can lead to incomplete search results and increased query times. Even yellow status can affect performance during node failures, as the cluster lacks full redundancy to handle failovers smoothly.
Q: How can I improve my cluster's health from yellow to green?
A: To improve from yellow to green, ensure that there are enough nodes to allocate all replica shards, check for any shard allocation issues using the Allocation Explain API, and verify that there's sufficient disk space on all nodes. Also check cluster settings that might be preventing shard allocation.
Q: What steps should I take if my cluster status turns red?
A: If your cluster status turns red, immediately investigate which primary shards are unallocated using GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason, check node status and logs, ensure adequate resources are available, and consider restoring from a backup if data loss has occurred. Use the Allocation Explain API to understand why shards aren't being allocated.
Q: What's the difference between the Cluster Health API and the Cat Health API?
A: The Cluster Health API returns detailed JSON responses with comprehensive metrics, ideal for programmatic monitoring and automation. The Cat Health API provides compact, human-readable output perfect for quick manual checks and shell scripts. Both provide the same underlying health information.
Q: How do I monitor cluster health in production environments?
A: Production clusters should have continuous monitoring through tools that aggregate metrics from the various health APIs. Set up alerts for status changes, resource thresholds, and performance degradation. Many teams use dedicated monitoring solutions that provide automated health checks, trend analysis, and proactive alerting.
Q: What resource metrics should I monitor alongside cluster health?
A: Monitor heap memory usage (should stay below 75%), CPU utilization (target below 80%), disk space (maintain at least 15-20% free), JVM garbage collection pauses (should be under 1 second), thread pool rejections (should be zero or minimal), and query/indexing latency compared to your baselines.
Q: Can a green cluster still have performance issues?
A: Yes, a green status only indicates that all shards are allocated. You can still experience performance issues due to resource constraints, inefficient queries, poor index design, or hardware limitations. Comprehensive monitoring should include performance metrics beyond just shard allocation status.
Q: How do pending tasks affect cluster health?
A: Pending tasks accumulating in the cluster state indicate that the master node is struggling to process updates. This can lead to delays in shard allocation, index creation, and cluster configuration changes. High pending task counts often signal resource constraints on the master node or overly frequent cluster state updates.