Elasticsearch Shard Imbalance Performance Issues

Shard imbalance occurs when shards are unevenly distributed across nodes, leading to some nodes being overloaded while others remain underutilized. This imbalance degrades overall cluster performance and can cause cascading failures.

Identifying Shard Imbalance

Check Shard Distribution

GET /_cat/allocation?v&s=shards:desc

This shows:

Number of shards per node
Disk space used per node
Disk indices (percentage used)

Compare Node Resources

GET /_cat/nodes?v&h=name,cpu,heap.percent,disk.used_percent,shards

Look for nodes with significantly higher values.

Check Shard Sizes

GET /_cat/shards?v&s=store:desc

Identify oversized shards or uneven shard sizes.

Types of Shard Imbalance

1. Count Imbalance

Some nodes have significantly more shards than others.

Cause: Allocation settings, node attributes, or failed rebalancing.

Solution:

// Trigger rebalancing
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "all",
    "cluster.routing.allocation.cluster_concurrent_rebalance": 5
  }
}

2. Size Imbalance

Some nodes store much more data despite similar shard counts.

Cause: Uneven shard sizes, hot indices routed to specific nodes.

Solution: Use disk-based allocation awareness:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.threshold_enabled": true,
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.balance.disk_usage": 0.45
  }
}

3. Hot Spot Imbalance

Certain shards receive disproportionate traffic.

Cause: Poor routing, time-based indices with uneven access patterns.

Solution: Implement proper index routing or use rollover indices.

Root Causes and Fixes

Cause 1: Allocation Filtering

Nodes may be excluded from allocation:

GET /_cluster/settings?include_defaults=true&filter_path=*.cluster.routing.allocation.*

Fix: Review and adjust allocation filters:

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._name": null
  }
}

Cause 2: Awareness Attributes

Zone or rack awareness may cause imbalance:

Fix: Ensure equal node capacity per zone:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values": "zone1,zone2,zone3"
  }
}

Cause 3: Index-Level Settings

Some indices may have routing requirements:

GET /my-index/_settings?filter_path=*.settings.index.routing

Fix: Remove or adjust index-level allocation settings:

PUT /my-index/_settings
{
  "index.routing.allocation.require._name": null
}

Cause 4: Insufficient Rebalancing

The cluster may not be rebalancing aggressively enough:

Fix: Adjust rebalancing settings:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.balance.threshold": 1.0,
    "cluster.routing.allocation.balance.shard": 0.45,
    "cluster.routing.allocation.balance.index": 0.55
  }
}

Cause 5: Primary Shard Concentration

Primary shards concentrated on fewer nodes:

Fix: Enable even primary distribution:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.balance.prefer_primary": true
  }
}

Manual Rebalancing

Move Specific Shards

For critical imbalances, manually move shards:

POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "my-index",
        "shard": 0,
        "from_node": "overloaded-node",
        "to_node": "underutilized-node"
      }
    }
  ]
}

Explain Allocation Decisions

Understand why a shard is on a particular node:

GET /_cluster/allocation/explain
{
  "index": "my-index",
  "shard": 0,
  "primary": true
}

Preventing Future Imbalance

1. Use Consistent Shard Sizing

Target 10-50 GB per shard with ILM:

PUT _ilm/policy/balanced_shards
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_primary_shard_size": "40gb",
            "max_age": "7d"
          }
        }
      }
    }
  }
}

2. Configure Index Templates

Set appropriate shard counts based on data volume:

PUT _index_template/balanced_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  }
}

3. Monitor Regularly

Set up alerts for imbalance:

Maximum variance in shard count > 20%
Maximum variance in disk usage > 30%
Any node consistently at higher CPU/memory

Metrics to Track

Metric	Healthy Range	Action Threshold
Shard count variance	< 10%	> 20%
Disk usage variance	< 15%	> 30%
CPU usage variance	< 20%	> 40%
Query latency variance	< 25%	> 50%