Effective capacity planning ensures your Elasticsearch cluster meets performance requirements while optimizing costs. This guide covers the complete capacity planning process.
Capacity Planning Process
1. Gather Requirements
Data Requirements:
- Current data volume
- Data growth rate
- Retention requirements
- Data model complexity
Performance Requirements:
- Search latency targets (p95, p99)
- Indexing throughput
- Query concurrency
- Availability requirements (SLA)
Operational Requirements:
- Backup/restore time
- Maintenance windows
- Disaster recovery
2. Analyze Workload
Indexing Profile:
GET /_nodes/stats/indices/indexing
- Documents per second
- Bulk request patterns
- Peak vs. average rates
Search Profile:
GET /_nodes/stats/indices/search
- Queries per second
- Query complexity
- Response size
3. Calculate Resources
See Elasticsearch Sizing Calculator for detailed formulas.
Resource Planning
Storage Planning
Calculate Total Storage
Total Storage = Daily Volume × Retention × (1 + Replicas) × Overhead
Overhead factors:
- Lucene overhead: 10-20%
- Translog: 5%
- Headroom: 20-30%
Storage Type Selection
| Storage Type | Use Case | Cost |
|---|---|---|
| NVMe SSD | Hot tier, high-performance | $$$ |
| SSD | Warm tier, general use | $$ |
| HDD | Cold tier, archival | $ |
Memory Planning
Heap Sizing
Rule: Heap should be about half of RAM but never above 32 GB.
Heap = min(Available RAM × 0.5, 31 GB)
Filesystem Cache
Filesystem Cache = Available RAM - Heap - OS Overhead
Target: At least equal to hot data size for optimal performance.
CPU Planning
Factors Affecting CPU
- Query complexity
- Aggregation depth
- Indexing rate
- Background operations (merges, recovery)
Estimation
CPU Cores = (Search QPS × Query CPU Cost) + (Index Rate × Index CPU Cost)
Typical starting point: 8-16 cores per data node.
Network Planning
Bandwidth Requirements
- Inter-node traffic: Recovery, replication
- Client traffic: Queries, bulk indexing
- Monitoring: Metrics, logs
Latency Requirements
- < 1ms between nodes (same zone)
- < 10ms for cross-zone
- Dedicated network for cluster traffic
Cluster Architecture
Small Cluster (< 100 GB/day)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ Master+Data │ │ Master+Data │ │ Master+Data │
│ 32GB RAM │ │ 32GB RAM │ │ 32GB RAM │
│ 1TB SSD │ │ 1TB SSD │ │ 1TB SSD │
└─────────────┘ └─────────────┘ └─────────────┘
Medium Cluster (100 GB - 1 TB/day)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Master 1 │ │ Master 2 │ │ Master 3 │
│ 8GB heap │ │ 8GB heap │ │ 8GB heap │
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Data 1 │ │ Data 2 │ │ Data 3 │ │ Data N │
│ 64GB RAM │ │ 64GB RAM │ │ 64GB RAM │ │ 64GB RAM │
│ 4TB SSD │ │ 4TB SSD │ │ 4TB SSD │ │ 4TB SSD │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Large Cluster (> 1 TB/day)
┌──────────────┐
│ Coordinating │ × 2-4
└──────────────┘
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Master │ │ Master │ │ Master │
└─────────┘ └─────────┘ └─────────┘
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Hot Data │ │Hot Data │ ... │Cold Data│ │Cold Data│
│ NVMe │ │ NVMe │ │ HDD │ │ HDD │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Hot Tier (recent data) Cold Tier (historical)
Growth Planning
Monitoring Growth
Track these metrics over time:
- Total document count
- Index size
- Daily indexing rate
- Query rate
Capacity Thresholds
Set alerts at:
- 70%: Plan expansion
- 80%: Execute expansion
- 85%: Urgent action
Expansion Strategies
Vertical Scaling:
- Increase node resources
- Better for small increases
- Limited by hardware
Horizontal Scaling:
- Add more nodes
- Better for large increases
- Requires rebalancing
Cost Optimization
Right-Sizing
- Start conservative, scale up based on data
- Monitor actual usage vs. provisioned
- Remove over-provisioned resources
Tiered Storage
Implement hot-warm-cold architecture:
PUT _ilm/policy/tiered_policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {"max_primary_shard_size": "50gb"}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {"number_of_shards": 1},
"allocate": {
"require": {"data": "warm"}
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": {
"require": {"data": "cold"}
}
}
}
}
}
}
Reserved Instances
For cloud deployments:
- Use reserved instances for base capacity
- On-demand for peak/burst
- Spot instances for non-critical workloads
Capacity Planning Checklist
Initial Planning
- Documented data volume and growth rate
- Defined retention requirements
- Established performance SLAs
- Calculated storage needs
- Determined node count and specs
- Planned cluster architecture
Ongoing Management
- Monitoring dashboards configured
- Capacity alerts set
- Growth projections updated quarterly
- Cost analysis performed regularly
- Performance baselines established