When Elasticsearch disk usage exceeds configured watermark thresholds, it automatically blocks write operations to prevent data loss. This guide explains how to recover from read-only blocks and prevent future occurrences.
Understanding Disk Watermarks
Default Watermark Thresholds
| Watermark | Default | Effect |
|---|---|---|
| Low | 85% | No new shards allocated to node |
| High | 90% | Shards relocated away from node |
| Flood Stage | 95% | Index becomes read-only |
Error Messages
When flood stage is reached:
ClusterBlockException: index [my-index] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark...];
Or:
index read-only / allow delete (api)
Diagnosing the Issue
Check Disk Usage
GET /_cat/allocation?v&h=node,disk.percent,disk.used,disk.avail,disk.total
Check Index Block Status
GET /_settings?filter_path=*.settings.index.blocks
Check Cluster Settings
GET /_cluster/settings?include_defaults=true&filter_path=*.cluster.routing.allocation.disk*
Immediate Recovery Steps
Step 1: Remove Read-Only Block
PUT /_all/_settings
{
"index.blocks.read_only_allow_delete": null
}
For specific indices:
PUT /my-index/_settings
{
"index.blocks.read_only_allow_delete": null
}
Step 2: Free Up Disk Space
Options to quickly free space:
Delete old indices:
DELETE /old-index-2023-01-*
Force merge and delete old segments:
POST /my-index/_forcemerge?only_expunge_deletes=true
Delete unnecessary replicas temporarily:
PUT /my-index/_settings
{
"index.number_of_replicas": 0
}
Step 3: Temporarily Adjust Watermarks
If you can't immediately free space:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "90%",
"cluster.routing.allocation.disk.watermark.high": "95%",
"cluster.routing.allocation.disk.watermark.flood_stage": "98%"
}
}
Warning: This is temporary! Plan to add storage or clean data.
Step 4: Verify Recovery
GET /_cluster/health
GET /_cat/allocation?v
Long-Term Solutions
Solution 1: Add More Disk Space
The most sustainable solution:
- Expand existing volumes
- Add new data nodes
- Use larger instance types (cloud)
Solution 2: Implement Index Lifecycle Management
PUT _ilm/policy/disk_management
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb",
"max_age": "7d"
}
}
},
"warm": {
"min_age": "30d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
Solution 3: Configure Proper Watermarks
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen": "97%",
"cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom": "20GB"
}
}
Solution 4: Use Absolute Values for Large Disks
For very large disks, use absolute values:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "500gb",
"cluster.routing.allocation.disk.watermark.high": "200gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "100gb"
}
}
Solution 5: Automatic Read-Only Removal (ES 7.4+)
Enable automatic removal when disk frees up:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.enable_for_single_data_node": true
}
}
Preventing Future Issues
Set Up Monitoring
Alert when disk usage approaches thresholds:
GET /_cat/allocation?v&h=node,disk.percent
Alert thresholds:
- Warning: 75%
- Critical: 80%
- Emergency: 85%
Implement Data Retention
- Use ILM for automatic index deletion
- Review retention policies regularly
- Archive old data to cheaper storage
Capacity Planning
- Track disk growth rate
- Plan for data growth + headroom
- Schedule regular capacity reviews
Quick Recovery Script
#!/bin/bash
# recover-from-readonly.sh
# Remove read-only blocks
curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d'
{
"index.blocks.read_only_allow_delete": null
}
'
# Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"
# Check disk allocation
curl -X GET "localhost:9200/_cat/allocation?v"
Understanding Watermark Behavior
Allocation Decisions
- Below low watermark: Normal operation
- Between low and high: New shards not allocated to node
- Between high and flood: Shards relocated away
- Above flood stage: Indices become read-only
Refresh Interval
Elasticsearch checks disk usage periodically:
PUT /_cluster/settings
{
"persistent": {
"cluster.info.update.interval": "30s"
}
}
Single-Node Clusters
For single-node clusters, behavior is different:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.enable_for_single_data_node": true
}
}
Troubleshooting Checklist
When encountering read-only errors:
- Check current disk usage:
GET /_cat/allocation?v - Identify which indices are blocked
- Remove read-only block
- Free up disk space or add capacity
- Verify writes are working
- Adjust watermarks if needed
- Set up monitoring to prevent recurrence
- Review ILM policies