Aggregations in Elasticsearch can be resource-intensive, especially with large datasets. This guide provides techniques to optimize aggregation performance while maintaining accuracy.
Understanding Aggregation Costs
Resource Consumption
Aggregations consume:
- Heap memory: Collecting and storing buckets
- CPU: Computing statistics
- Network: Transferring results between nodes
Aggregation Types by Cost
| Type | Memory Cost | CPU Cost | Notes |
|---|---|---|---|
| Value count | Low | Low | Most efficient |
| Sum/Avg/Min/Max | Low | Low | Simple statistics |
| Cardinality | Medium | Low | HyperLogLog approximation |
| Terms | High | Medium | Depends on cardinality |
| Date histogram | Medium | Medium | Depends on interval |
| Nested | High | High | Per nested document |
Optimization Techniques
1. Reduce Bucket Count
Before (expensive):
{
"aggs": {
"categories": {
"terms": {
"field": "category.keyword",
"size": 10000
}
}
}
}
After (optimized):
{
"aggs": {
"categories": {
"terms": {
"field": "category.keyword",
"size": 100
}
}
}
}
2. Use shard_size Appropriately
{
"aggs": {
"categories": {
"terms": {
"field": "category.keyword",
"size": 10,
"shard_size": 50 // Collect more per shard for accuracy
}
}
}
}
Rule of thumb: shard_size = size * 1.5 + 10
3. Use Composite Aggregation for High Cardinality
Instead of large terms aggregation:
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 1000,
"sources": [
{"category": {"terms": {"field": "category.keyword"}}},
{"region": {"terms": {"field": "region.keyword"}}}
]
}
}
}
}
// Paginate with after
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 1000,
"after": {"category": "last_category", "region": "last_region"},
"sources": [...]
}
}
}
}
4. Filter Before Aggregating
{
"query": {
"bool": {
"filter": [
{"range": {"date": {"gte": "now-7d"}}},
{"term": {"status": "active"}}
]
}
},
"aggs": {
"categories": {
"terms": {"field": "category.keyword", "size": 10}
}
}
}
5. Use Sampler for Large Datasets
{
"aggs": {
"sample": {
"sampler": {
"shard_size": 1000
},
"aggs": {
"categories": {
"terms": {"field": "category.keyword"}
}
}
}
}
}
6. Avoid Aggregations on Text Fields
Problem (uses fielddata):
{
"aggs": {
"keywords": {
"terms": {"field": "description"} // Text field!
}
}
}
Solution (use keyword):
{
"aggs": {
"keywords": {
"terms": {"field": "description.keyword"}
}
}
}
7. Use execution_hint for Terms
{
"aggs": {
"categories": {
"terms": {
"field": "category.keyword",
"execution_hint": "map" // For low cardinality
// or "global_ordinals" (default) for high cardinality
}
}
}
}
8. Minimize Nested Aggregations
Expensive:
{
"aggs": {
"level1": {
"terms": {"field": "a"},
"aggs": {
"level2": {
"terms": {"field": "b"},
"aggs": {
"level3": {
"terms": {"field": "c"}
}
}
}
}
}
}
}
Better: Flatten where possible or limit depth.
9. Use Filters Aggregation Efficiently
{
"aggs": {
"categories": {
"filters": {
"filters": {
"active": {"term": {"status": "active"}},
"pending": {"term": {"status": "pending"}}
}
}
}
}
}
10. Pre-Aggregate with Transforms
For repeated aggregations, use transforms:
PUT _transform/daily_sales
{
"source": {"index": "sales"},
"dest": {"index": "daily_sales_summary"},
"pivot": {
"group_by": {
"date": {"date_histogram": {"field": "timestamp", "calendar_interval": "day"}},
"product": {"terms": {"field": "product_id"}}
},
"aggregations": {
"total_sales": {"sum": {"field": "amount"}},
"count": {"value_count": {"field": "_id"}}
}
}
}
Memory Management
Configure Circuit Breakers
PUT /_cluster/settings
{
"persistent": {
"indices.breaker.request.limit": "40%"
}
}
Monitor Aggregation Memory
GET /_nodes/stats/indices/fielddata
GET /_nodes/stats/breaker
Clear Fielddata Cache
POST /_cache/clear?fielddata=true
Cardinality Optimization
Use Approximate Cardinality
{
"aggs": {
"unique_users": {
"cardinality": {
"field": "user_id",
"precision_threshold": 1000 // Trade accuracy for speed
}
}
}
}
Pre-Compute Cardinality
Index a hash field for faster cardinality:
PUT /my_index/_mapping
{
"properties": {
"user_id_hash": {
"type": "murmur3"
}
}
}
Date Histogram Optimization
Use Fixed Intervals When Possible
{
"aggs": {
"over_time": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "1h" // More efficient than calendar_interval
}
}
}
}
Use Auto Date Histogram
{
"aggs": {
"over_time": {
"auto_date_histogram": {
"field": "timestamp",
"buckets": 10 // System chooses optimal interval
}
}
}
}
Performance Testing
Profile Aggregations
{
"profile": true,
"aggs": {
"my_agg": {...}
}
}
Benchmark Changes
# Before optimization
time curl -s "localhost:9200/my-index/_search" -d'{"aggs":{...}}'
# After optimization
time curl -s "localhost:9200/my-index/_search" -d'{"aggs":{...}}'
Aggregation Checklist
- Bucket count minimized (
sizeparameter) - Using filter context to reduce document scope
- Keyword fields used (not text)
- Composite aggregation for high-cardinality pagination
- Sampler used for large datasets when approximate is acceptable
- Nested aggregation depth minimized
- Transform used for repeated aggregations
- Circuit breakers configured appropriately