Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch High IOWait Fix

High I/O wait (iowait) occurs when the CPU is idle but waiting for disk operations to complete. In Elasticsearch, this leads to degraded performance, slow queries, and potential cluster instability. This guide provides solutions to reduce iowait.

Understanding IOWait

What is IOWait?

IOWait represents the percentage of time the CPU spends waiting for I/O operations. High iowait indicates:

  • Disk is the bottleneck
  • CPU could do more work if disk was faster
  • Operations are queuing for disk access

Healthy vs. Problematic Levels

IOWait % Status Action
< 5% Healthy Normal operation
5-20% Warning Monitor and investigate
> 20% Critical Immediate action needed

Diagnosing High IOWait

Check Current IOWait

# Quick check
top
# Look for "%wa" in the CPU line

# Detailed view
vmstat 1 10
# Check "wa" column

# Per-CPU breakdown
mpstat -P ALL 1 5

Identify I/O-Heavy Processes

# Show I/O by process
iotop -o

# Show disk utilization
iostat -x 1 5

Elasticsearch-Specific Checks

GET /_nodes/stats/fs
GET /_cat/thread_pool?v&h=node_name,name,active,queue&s=queue:desc
GET /_nodes/hot_threads?type=wait

Causes and Fixes

Fix 1: Upgrade to SSDs

The most impactful change for high iowait:

Before (HDD):

  • Random I/O: ~100-200 IOPS
  • Latency: 5-15ms

After (SSD):

  • Random I/O: 10,000-100,000+ IOPS
  • Latency: <1ms
# Verify disk type
cat /sys/block/sda/queue/rotational
# 1 = HDD, 0 = SSD

Fix 2: Reduce Merge Activity

Segment merging causes significant I/O:

PUT /my-index/_settings
{
  "index.merge.scheduler.max_thread_count": 1,
  "index.merge.policy.max_merged_segment": "5gb",
  "index.merge.policy.segments_per_tier": 10
}

Fix 3: Increase Refresh Interval

Reduce flush frequency:

PUT /my-index/_settings
{
  "index.refresh_interval": "30s"
}

For bulk indexing, disable refresh temporarily:

PUT /my-index/_settings
{
  "index.refresh_interval": "-1"
}

// After bulk indexing
PUT /my-index/_settings
{
  "index.refresh_interval": "1s"
}

Fix 4: Optimize Translog Settings

PUT /my-index/_settings
{
  "index.translog.durability": "async",
  "index.translog.sync_interval": "30s",
  "index.translog.flush_threshold_size": "1gb"
}

Fix 5: Ensure Adequate Filesystem Cache

High cache miss rate causes read I/O:

# Check cache usage
free -h

# Look at "buff/cache" column

Solution: Keep heap ≤ 50% of RAM to leave memory for OS cache.

Fix 6: Limit Recovery Bandwidth

During node recovery:

PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "40mb"
  }
}

Fix 7: Schedule Force Merges

Instead of continuous merging, schedule during off-peak:

POST /my-index/_forcemerge?max_num_segments=1

Run this during maintenance windows, not during peak traffic.

Fix 8: Use Multiple Data Paths

Distribute I/O across disks:

# elasticsearch.yml
path.data:
  - /mnt/disk1/elasticsearch
  - /mnt/disk2/elasticsearch

Fix 9: Disable Swap

Swap causes massive I/O issues:

# Disable swap
swapoff -a

# Or set swappiness very low
echo 1 > /proc/sys/vm/swappiness

Enable memory locking in Elasticsearch:

# elasticsearch.yml
bootstrap.memory_lock: true

Fix 10: Filesystem Optimization

# Mount with optimized options
# /etc/fstab
/dev/sda1 /data ext4 defaults,noatime,nodiratime 0 0

# For SSDs, ensure TRIM is enabled
fstrim -v /data

Index-Level Optimization

Write-Heavy Indices

PUT /logs-write-heavy
{
  "settings": {
    "index.refresh_interval": "60s",
    "index.translog.durability": "async",
    "index.translog.sync_interval": "60s",
    "index.merge.scheduler.max_thread_count": 1,
    "index.number_of_replicas": 0
  }
}

Note: Increase replicas after bulk loading.

Read-Heavy Indices

PUT /search-index
{
  "settings": {
    "index.store.type": "hybridfs",
    "index.queries.cache.enabled": true
  }
}

Monitoring IOWait

Set Up Alerts

Alert when:

  • IOWait > 15% for 5+ minutes
  • Disk utilization > 80%
  • I/O latency > 20ms average

Continuous Monitoring

# Log iowait every minute
while true; do
  echo "$(date): $(vmstat 1 2 | tail -1 | awk '{print $16}')" >> /var/log/iowait.log
  sleep 60
done

Elasticsearch Monitoring

GET /_cat/nodes?v&h=name,disk.used_percent,load_1m,cpu

Prevention Checklist

  • SSDs in use (NVMe preferred)
  • No swap or swap disabled
  • Heap ≤ 50% of RAM
  • Filesystem mounted with noatime
  • Merge threads limited
  • Refresh interval appropriate
  • Recovery bandwidth limited
  • Monitoring and alerting configured

Quick Fixes During High IOWait

If experiencing high iowait right now:

// 1. Reduce indexing rate (inform clients)

// 2. Disable refresh temporarily
PUT /*/_settings
{
  "index.refresh_interval": "-1"
}

// 3. Stop non-essential recoveries
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}

// 4. After stabilization, re-enable
PUT /*/_settings
{
  "index.refresh_interval": "30s"
}

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.