Elasticsearch Aggregation Performance Tuning

Aggregations in Elasticsearch can be resource-intensive, especially with large datasets. This guide provides techniques to optimize aggregation performance while maintaining accuracy.

Understanding Aggregation Costs

Resource Consumption

Aggregations consume:

Heap memory: Collecting and storing buckets
CPU: Computing statistics
Network: Transferring results between nodes

Aggregation Types by Cost

Type	Memory Cost	CPU Cost	Notes
Value count	Low	Low	Most efficient
Sum/Avg/Min/Max	Low	Low	Simple statistics
Cardinality	Medium	Low	HyperLogLog approximation
Terms	High	Medium	Depends on cardinality
Date histogram	Medium	Medium	Depends on interval
Nested	High	High	Per nested document

Optimization Techniques

1. Reduce Bucket Count

Before (expensive):

{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.keyword",
        "size": 10000
      }
    }
  }
}

After (optimized):

{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.keyword",
        "size": 100
      }
    }
  }
}

2. Use shard_size Appropriately

{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.keyword",
        "size": 10,
        "shard_size": 50  // Collect more per shard for accuracy
      }
    }
  }
}

Rule of thumb: shard_size = size * 1.5 + 10

3. Use Composite Aggregation for High Cardinality

Instead of large terms aggregation:

{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 1000,
        "sources": [
          {"category": {"terms": {"field": "category.keyword"}}},
          {"region": {"terms": {"field": "region.keyword"}}}
        ]
      }
    }
  }
}

// Paginate with after
{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 1000,
        "after": {"category": "last_category", "region": "last_region"},
        "sources": [...]
      }
    }
  }
}

4. Filter Before Aggregating

{
  "query": {
    "bool": {
      "filter": [
        {"range": {"date": {"gte": "now-7d"}}},
        {"term": {"status": "active"}}
      ]
    }
  },
  "aggs": {
    "categories": {
      "terms": {"field": "category.keyword", "size": 10}
    }
  }
}

5. Use Sampler for Large Datasets

{
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 1000
      },
      "aggs": {
        "categories": {
          "terms": {"field": "category.keyword"}
        }
      }
    }
  }
}

6. Avoid Aggregations on Text Fields

Problem (uses fielddata):

{
  "aggs": {
    "keywords": {
      "terms": {"field": "description"}  // Text field!
    }
  }
}

Solution (use keyword):

{
  "aggs": {
    "keywords": {
      "terms": {"field": "description.keyword"}
    }
  }
}

7. Use execution_hint for Terms

{
  "aggs": {
    "categories": {
      "terms": {
        "field": "category.keyword",
        "execution_hint": "map"  // For low cardinality
        // or "global_ordinals" (default) for high cardinality
      }
    }
  }
}

8. Minimize Nested Aggregations

Expensive:

{
  "aggs": {
    "level1": {
      "terms": {"field": "a"},
      "aggs": {
        "level2": {
          "terms": {"field": "b"},
          "aggs": {
            "level3": {
              "terms": {"field": "c"}
            }
          }
        }
      }
    }
  }
}

Better: Flatten where possible or limit depth.

9. Use Filters Aggregation Efficiently

{
  "aggs": {
    "categories": {
      "filters": {
        "filters": {
          "active": {"term": {"status": "active"}},
          "pending": {"term": {"status": "pending"}}
        }
      }
    }
  }
}

10. Pre-Aggregate with Transforms

For repeated aggregations, use transforms:

PUT _transform/daily_sales
{
  "source": {"index": "sales"},
  "dest": {"index": "daily_sales_summary"},
  "pivot": {
    "group_by": {
      "date": {"date_histogram": {"field": "timestamp", "calendar_interval": "day"}},
      "product": {"terms": {"field": "product_id"}}
    },
    "aggregations": {
      "total_sales": {"sum": {"field": "amount"}},
      "count": {"value_count": {"field": "_id"}}
    }
  }
}

Memory Management

Configure Circuit Breakers

PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.request.limit": "40%"
  }
}

Monitor Aggregation Memory

GET /_nodes/stats/indices/fielddata
GET /_nodes/stats/breaker

Clear Fielddata Cache

POST /_cache/clear?fielddata=true

Cardinality Optimization

Use Approximate Cardinality

{
  "aggs": {
    "unique_users": {
      "cardinality": {
        "field": "user_id",
        "precision_threshold": 1000  // Trade accuracy for speed
      }
    }
  }
}

Pre-Compute Cardinality

Index a hash field for faster cardinality:

PUT /my_index/_mapping
{
  "properties": {
    "user_id_hash": {
      "type": "murmur3"
    }
  }
}

Date Histogram Optimization

Use Fixed Intervals When Possible

{
  "aggs": {
    "over_time": {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "1h"  // More efficient than calendar_interval
      }
    }
  }
}

Use Auto Date Histogram

{
  "aggs": {
    "over_time": {
      "auto_date_histogram": {
        "field": "timestamp",
        "buckets": 10  // System chooses optimal interval
      }
    }
  }
}

Performance Testing

Profile Aggregations

{
  "profile": true,
  "aggs": {
    "my_agg": {...}
  }
}

Benchmark Changes

# Before optimization
time curl -s "localhost:9200/my-index/_search" -d'{"aggs":{...}}'

# After optimization
time curl -s "localhost:9200/my-index/_search" -d'{"aggs":{...}}'

Aggregation Checklist

Bucket count minimized (size parameter)
Using filter context to reduce document scope
Keyword fields used (not text)
Composite aggregation for high-cardinality pagination
Sampler used for large datasets when approximate is acceptable
Nested aggregation depth minimized
Transform used for repeated aggregations
Circuit breakers configured appropriately