Elasticsearch cluster status is red (unassigned primary shards)

Brief Explanation

An Elasticsearch red cluster status is an indication that one or more primary shards in the cluster are unassigned. This is a critical error that requires immediate attention, as it means that some data in the cluster is unavailable.

Impact

This error has a significant impact on the Elasticsearch cluster:

Data unavailability: The indices with unassigned primary shards are partially or completely inaccessible.
Search and indexing operations: Queries and indexing operations involving the affected shards will fail.
Cluster health: The overall health of the cluster is compromised, potentially affecting other operations and services relying on Elasticsearch.

Common Causes

Node failure: One or more nodes in the cluster have gone offline.
Disk space issues: Nodes have run out of disk space.
Network problems: Network partitions or connectivity issues between nodes.
Configuration errors: Incorrect settings in elasticsearch.yml or other configuration files.
Hardware failures: Disk or other hardware component failures.

Troubleshooting and Resolution Steps

Check cluster health:
```
GET /_cluster/health
```

Identify unassigned shards:

GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason

Investigate node status:
```
GET /_cat/nodes?v
```
Check disk space on all nodes:
```
GET /_cat/allocation?v
```
Review Elasticsearch logs for error messages.
If nodes are offline, investigate and restart them if necessary.
If disk space is the issue, free up space or add more storage.
For network issues, check network connectivity between nodes.
Review and correct any configuration errors in elasticsearch.yml.
If hardware failure is suspected, replace faulty components.
Once issues are resolved, try to reallocate unassigned shards:
```
POST /_cluster/reroute?retry_failed=true
```

If shards are still unassigned, force allocation (use with caution):

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

Best Practices

Implement proper monitoring and alerting for cluster health.
Regularly check and manage disk space across all nodes.
Use multiple master-eligible nodes to improve cluster stability.
Implement a robust backup strategy to prevent data loss.
Keep Elasticsearch and its dependencies up to date.

Frequently Asked Questions

Q: How long does it take for Elasticsearch to recover from a red status?
A: Recovery time varies depending on the cause and the amount of data. It can range from minutes to hours. Addressing the root cause promptly is crucial for faster recovery.

Q: Can I still query Elasticsearch when the cluster status is red?
A: You can query indices that have all primary shards assigned, but queries involving unassigned shards will fail. It's best to resolve the red status before performing critical operations.

Q: Will I lose data if my Elasticsearch cluster status is red?
A: Not necessarily. A red status indicates unavailability, not data loss. However, if the cause is hardware failure or corruption, data loss is possible. This underscores the importance of regular backups.

Q: How can I prevent my Elasticsearch cluster from going into a red status?
A: Implement proactive monitoring, ensure adequate resources (especially disk space), use multiple master-eligible nodes, and follow Elasticsearch best practices for configuration and maintenance.

Q: Is it safe to force shard allocation when the cluster is in red status?
A: Forcing shard allocation should be done cautiously. It's generally safe if you're sure the original cause of the red status has been resolved. However, it can lead to data loss if used incorrectly, so it's best to understand the root cause first.