Meet the Pulse team at AWS re:Invent!

Read more

Elasticsearch Disk I/O Bottleneck Troubleshooting

Disk I/O bottlenecks significantly impact Elasticsearch performance, causing slow indexing, delayed searches, and overall cluster degradation. This guide helps you identify and resolve disk-related performance issues.

Understanding Elasticsearch Disk I/O

I/O-Intensive Operations

Elasticsearch performs frequent disk operations:

  • Indexing: Writing new documents and segments
  • Merging: Combining segments
  • Searching: Reading segments (when not cached)
  • Recovery: Copying shard data between nodes
  • Translog: Writing transaction logs

Optimal Storage Requirements

Storage Type Performance Recommendation
NVMe SSD Excellent Best choice for all workloads
SATA SSD Good Suitable for most workloads
HDD Poor Only for cold/frozen tiers

Identifying Disk I/O Bottlenecks

System-Level Monitoring

# Check disk utilization
iostat -x 1 10

# Key metrics to watch:
# - %util: Percentage of time disk is busy (>80% is concerning)
# - await: Average wait time for I/O requests
# - iops: Operations per second

Node Statistics

GET /_nodes/stats/fs

Look for:

  • total.available_in_bytes: Free disk space
  • io_stats.total.operations: I/O operation count
  • io_stats.total.read_kilobytes / write_kilobytes: Throughput

Thread Pool Queue

GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected&s=queue:desc

High generic or write queues may indicate I/O waits.

Hot Threads Analysis

GET /_nodes/hot_threads?type=wait

Wait-type hot threads can reveal I/O blocking.

Common Causes and Solutions

Cause 1: Using HDDs Instead of SSDs

Symptoms:

  • High disk utilization (>80%)
  • Long await times
  • Slow indexing despite low CPU/memory

Solution: Migrate to SSDs. This single change often has the biggest impact:

  • 10-100x improvement in random I/O
  • Significantly lower latency
  • Better handling of concurrent operations

Cause 2: Merge Operations Overwhelming Disk

Symptoms:

  • High I/O during and after bulk indexing
  • Spikes in disk utilization
  • Hot threads showing merge operations

Diagnosis:

GET /_nodes/stats/indices/merges

Solutions:

Limit merge thread count:

PUT /my-index/_settings
{
  "index.merge.scheduler.max_thread_count": 1
}

For HDDs, use single merge thread:

PUT /my-index/_settings
{
  "index.merge.scheduler.max_thread_count": 1
}

Cause 3: Too-Frequent Refreshes

Symptoms:

  • Constant I/O even with low indexing rate
  • Many small segments

Diagnosis:

GET /_cat/segments?v&index=my-index

Solution: Increase refresh interval for write-heavy indices:

PUT /my-index/_settings
{
  "index.refresh_interval": "30s"
}

Cause 4: Translog Flush Too Frequent

Symptoms:

  • High disk writes
  • Sync operations in I/O stats

Diagnosis:

GET /my-index/_settings?filter_path=*.settings.index.translog

Solution: Adjust translog settings for write-heavy workloads:

PUT /my-index/_settings
{
  "index.translog.durability": "async",
  "index.translog.sync_interval": "30s"
}

Warning: Async durability means potential data loss on crash.

Cause 5: Insufficient Filesystem Cache

Symptoms:

  • High read I/O on hot data
  • Cache misses during searches

Solution:

  • Ensure heap is ≤50% of RAM
  • Leave remaining memory for OS filesystem cache
  • If heap is 16 GB, ensure at least 16 GB free for cache

Cause 6: Recovery Operations

Symptoms:

  • High I/O during node restarts or shard allocation
  • Cluster recovery taking too long

Solution: Limit recovery bandwidth:

PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "50mb"
  }
}

Cause 7: Small Disk or RAID Misconfiguration

Symptoms:

  • I/O bottleneck despite using SSDs
  • Lower than expected throughput

Solutions:

  • Use RAID 0 or no RAID (Elasticsearch handles replication)
  • Ensure proper disk alignment
  • Use multiple data paths for parallel I/O:
# elasticsearch.yml
path.data:
  - /mnt/disk1/elasticsearch
  - /mnt/disk2/elasticsearch

Optimizing Disk I/O

Index Settings for I/O Optimization

PUT /optimized-index
{
  "settings": {
    "index.refresh_interval": "30s",
    "index.translog.durability": "async",
    "index.translog.sync_interval": "30s",
    "index.merge.scheduler.max_thread_count": 2,
    "index.codec": "best_compression"  // Trade CPU for less disk
  }
}

Cluster Settings

PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "100mb",
    "indices.recovery.max_concurrent_file_chunks": 2
  }
}

Storage Best Practices

  1. Use SSDs (NVMe preferred)
  2. Avoid shared storage (NFS, network drives)
  3. Enable TRIM for SSDs
  4. Use XFS or ext4 filesystems
  5. Disable atime in mount options:
# /etc/fstab
/dev/sda1 /data ext4 defaults,noatime 0 0

Monitoring I/O Performance

Key Metrics to Track

Metric Warning Critical
Disk utilization > 70% > 85%
I/O await (ms) > 20 > 50
Read/Write latency > 10ms > 50ms
IOPS saturation > 80% > 95%

Elasticsearch Monitoring

GET /_cat/nodes?v&h=name,disk.used_percent,disk.indices
GET /_nodes/stats/fs

System Monitoring Commands

# Real-time I/O monitoring
iotop -o

# Disk performance test
fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=1 --size=1G --numjobs=4 --runtime=60

Troubleshooting Checklist

  • Storage type verified (SSD recommended)
  • Disk utilization < 80%
  • Heap ≤ 50% of RAM (leaving cache space)
  • Refresh interval appropriate for workload
  • Merge threads limited (especially for HDD)
  • Recovery bandwidth limited
  • No swap or swappiness = 1
  • Filesystem mounted with noatime
Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.