Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

Read more

Elasticsearch Too Many Shards Performance Issues

Having too many shards in an Elasticsearch cluster (oversharding) causes significant performance degradation and resource waste. This guide helps you identify oversharding problems and implement effective solutions.

Understanding the Problem

Why Too Many Shards Hurts Performance

  1. Memory overhead: Each shard consumes heap memory for metadata, segments, and caches
  2. Coordination cost: Queries must coordinate across all shards
  3. Cluster state size: More shards = larger cluster state to manage
  4. Recovery time: Many shards slow down node recovery
  5. File handles: Each shard opens multiple file handles

How Many Shards Is Too Many?

There's no absolute number, but consider these guidelines:

  • Per node: Avoid exceeding 1,000 shards per node
  • Per GB heap: The old "20 shards per GB heap" rule is deprecated but still directionally useful
  • Per index: Question if you need more than a few shards per index
  • Shard size: Shards under 1 GB are often too small

Identifying Oversharding

Check Total Shard Count

GET /_cluster/stats?filter_path=indices.shards.total
GET /_cat/shards?v | wc -l

Shards Per Node

GET /_cat/allocation?v

Average Shard Size

# Get shard sizes
GET /_cat/shards?v&h=index,shard,prirep,store&s=store:asc

If many shards are under 1 GB, you likely have oversharding.

Indices with Many Small Shards

GET /_cat/indices?v&h=index,pri,rep,store.size,pri.store.size&s=pri:desc

Symptoms of Oversharding

Performance Symptoms

  • Slow search queries despite low data volume
  • High memory usage with low data volume
  • Long cluster recovery times
  • Master node CPU spikes during state updates

Observable Metrics

Metric Healthy Oversharded
Shards per node < 500 > 1000
Average shard size 10-50 GB < 1 GB
Cluster state size < 100 MB > 500 MB
Heap per shard Sufficient High pressure

Solutions

Solution 1: Use Index Lifecycle Management (ILM)

Configure ILM to control shard growth:

PUT _ilm/policy/controlled_shards
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_primary_shard_size": "40gb",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Solution 2: Shrink Existing Indices

Reduce shard count on existing indices:

// Step 1: Block writes and relocate to single node
PUT /my-index/_settings
{
  "settings": {
    "index.blocks.write": true,
    "index.routing.allocation.require._name": "shrink-node"
  }
}

// Step 2: Wait for relocation
GET /_cat/recovery/my-index

// Step 3: Shrink
POST /my-index/_shrink/my-index-shrunk
{
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 1,
    "index.blocks.write": null,
    "index.routing.allocation.require._name": null
  }
}

Solution 3: Reindex with Fewer Shards

For major restructuring:

// Create new index with fewer shards
PUT /my-index-reindexed
{
  "settings": {
    "number_of_shards": 3,  // Instead of 30
    "number_of_replicas": 1
  }
}

// Reindex data
POST /_reindex
{
  "source": {
    "index": "my-index"
  },
  "dest": {
    "index": "my-index-reindexed"
  }
}

Solution 4: Delete Unnecessary Indices

Remove old, unused indices:

GET /_cat/indices?v&h=index,docs.count,store.size&s=docs.count:asc

DELETE /old-unused-index-*

Solution 5: Use Index Templates with Proper Defaults

Prevent future oversharding:

PUT _index_template/optimized_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 1
    }
  },
  "priority": 200
}

Solution 6: Use Data Streams

For time-series data, use data streams with ILM:

PUT _index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "data_stream": {},
  "template": {
    "settings": {
      "number_of_shards": 1
    }
  }
}

PUT _data_stream/logs-app

Calculating Ideal Shard Count

Formula for Time-Series Data

Daily data size / Target shard size = Shards per day

Example:
100 GB/day / 40 GB target = 2-3 shards per day

Formula for Static Data

Total data size / Target shard size = Total primary shards

Example:
500 GB total / 40 GB target = ~13 primary shards

Monitoring Shard Health

Regular Checks

# Total shards
GET /_cluster/health?filter_path=active_shards

# Shards per node
GET /_cat/allocation?v

# Small shards (potential oversharding)
GET /_cat/shards?v&h=index,shard,store&s=store:asc | head -50

Alert Conditions

  • Total shard count > (node_count * 500)
  • Average shard size < 5 GB
  • Cluster state size growing rapidly

Prevention Checklist

  • Index templates specify appropriate shard counts
  • ILM policies control shard growth with rollover
  • Data streams used for time-series data
  • Regular review of shard distribution
  • Deletion policies for old indices
  • Target shard size: 10-50 GB

Automating Detection and Resolution with Pulse

Oversharding problems often develop gradually, making them easy to miss until performance visibly degrades. Pulse automatically detects oversharding conditions through continuous monitoring of shard counts, shard sizes, cluster state size, and per-node resource utilization. When oversharding is identified—whether from excessive shard-per-node ratios, many undersized shards, or rapidly growing cluster state—Pulse's agentic SRE technology performs automated root-cause analysis, determining which indices are contributing most to shard proliferation, whether index templates are misconfigured, or if ILM policies are missing or ineffective.

Pulse can suggest specific remediation actions—such as shrinking specific indices, adjusting index template shard counts, configuring ILM rollover policies, or identifying unused indices safe for deletion—and in some cases apply these optimizations automatically. Connecting your cluster to Pulse's proactive monitoring helps prevent oversharding before it impacts query latency, increases recovery times, or causes heap memory pressure across your nodes.

Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.