An Elasticsearch node is a single Elasticsearch JVM process that joins a named cluster. A node holds shards of indices, runs queries, coordinates with peers via the transport protocol, and optionally takes on specialized roles like master eligibility, ingest pipelines, machine learning jobs, or transforms. Clusters are formed by nodes that share a cluster.name and can discover each other via discovery.seed_hosts.
How Elasticsearch Nodes Form a Cluster
Each node has a unique name and ID. Nodes communicate over the transport layer (default port 9300) and serve client traffic over HTTP (default port 9200). On startup, a node:
- Reads its
node.rolesfromelasticsearch.yml. - Discovers other nodes via
discovery.seed_hosts. - Joins the cluster if the master is reachable, or participates in master election if eligible.
- Receives shard allocation decisions from the master and starts serving traffic.
The master node maintains the cluster state (mapping, settings, shard allocations) and replicates it to all other nodes. Master election requires a quorum of master-eligible nodes - the cluster needs at least 3 master-eligible nodes to survive a single failure without split-brain.
Elasticsearch Node Roles
A single node can have multiple roles. For small clusters this is fine. For production clusters above ~6 nodes, dedicate roles for stability.
Role (node.roles value) |
Purpose | Typical hardware |
|---|---|---|
master |
Cluster state, shard allocation, mapping updates | Low CPU, modest RAM, fast disk for cluster state |
data |
Holds shards, serves queries, indexes documents | High RAM (heap), SSDs, balanced CPU |
data_hot / data_warm / data_cold / data_frozen |
Tiered data nodes for ILM | Hot: SSD; cold/frozen: cheap disk |
ingest |
Runs ingest pipelines before indexing | Compute-bound, modest RAM |
ml |
Machine Learning jobs and inference | High CPU/RAM, often GPUs in 8.x+ |
transform |
Continuous transforms | Balanced |
remote_cluster_client |
Cross-cluster search/replication client | Lightweight |
coordinating only (no roles set) |
Routes requests, gathers shard results | High CPU, low disk |
The dedicated-master pattern (3 master-eligible nodes that hold no data) is standard for clusters with >10 data nodes. It keeps cluster-state changes fast and prevents data-node failures from destabilizing the master quorum.
Coordinating Behavior
Every node can coordinate. When a client sends a search request to any node, that node becomes the coordinator for the request: it scatters subqueries to the relevant shards, gathers results, merges, and returns. Dedicated coordinating-only nodes (set node.roles: []) absorb this work in front of large clusters so the data nodes can focus on shard work.
JVM Heap and Memory Sizing
| Constraint | Rule |
|---|---|
| Max heap | 30-31 GB (compressed object pointers cutoff on most JVMs) |
| Heap as % of RAM | Up to 50% of physical RAM, leave the rest for the page cache |
| Off-heap | Lucene memory-maps segments; the OS page cache does the heavy lifting |
| Shards per heap GB | Roughly 20 (Elastic guideline) |
A node with 64 GB RAM gets 31 GB heap and leaves ~33 GB for the page cache and OS. Going above 31 GB heap is usually a regression - you lose compressed oops and add GC latency.
Common Node Topology Mistakes
- Running 1 or 2 master-eligible nodes. A two-node setup will split-brain or stall on master failure; the minimum stable count is 3.
- Mixing data and master roles on small heap. Data work (queries, merges) starves the master thread of CPU; cluster state updates lag.
- Setting heap above 31 GB and losing compressed oops, which silently makes everything slower.
- Allocating all RAM to heap, leaving no page cache. Search latency on cold segments tanks.
- No
bootstrap.memory_lock: trueon Linux production hosts - the OS swaps the heap and pauses go through the roof. - Skipping
cluster.routing.allocation.awarenessin multi-AZ deployments. A zone outage takes down all replicas of some shards.
Operating Nodes in Production
Watch:
- JVM heap usage (
_nodes/stats/jvm) - sustained >75% with frequent GC means undersized heap or a hot query. - CPU and load average per node - imbalance is usually shard-routing or hot-key driven.
- Disk usage and watermarks - Elasticsearch stops allocating shards when a node hits the
cluster.routing.allocation.disk.watermark.high(default 90%) and starts moving them off at the flood-stage watermark (default 95%). - Pending tasks (
_cluster/pending_tasks) on the master - persistent queue means cluster-state churn. - Hot threads (
_nodes/hot_threads) when latency spikes.
Pulse monitors all of these per node and across the cluster, with automated thresholds calibrated to Elastic's published guidance. When a node starts heading toward GC trouble or disk-watermark eviction, Pulse's agentic root-cause analysis identifies the actual driver (a single hot shard, a runaway query, a leaking aggregation cache) instead of just paging on heap percentage. Connecting your cluster to proactive Pulse monitoring catches these patterns before they cascade.
Frequently Asked Questions
Q: How many Elasticsearch nodes do I need in production?
A: Minimum 3 master-eligible nodes for HA (so quorum survives a single failure). Data node count depends on data volume and query load - start with 3 data nodes and scale horizontally. Below 3 nodes you have either no HA (single node) or split-brain risk (two nodes).
Q: What is the difference between a master node and a data node in Elasticsearch?
A: A master node manages the cluster state (mappings, shard allocation, node membership). A data node holds shards and serves search/index traffic. A single node can do both, but dedicated master nodes (no data role) is the production-recommended pattern for clusters >6 nodes.
Q: Can I run multiple Elasticsearch nodes on one machine?
A: Technically yes - distinct path.data and ports - but production deployments use one node per machine for fault isolation and to avoid memory/IO contention. Running multiple nodes on a host also subverts shard-allocation awareness, which assumes one node = one fault domain.
Q: What happens when an Elasticsearch node fails?
A: The master detects the loss (after cluster.fault_detection.follower_check.timeout, default 10s), marks the node's primary shards as unassigned, and promotes replicas to primaries. Unreplicated shards (number_of_replicas: 0) become unavailable until the node returns. If the failed node was the master, the remaining master-eligible nodes elect a new one.
Q: How much heap should an Elasticsearch node have?
A: Up to 50% of physical RAM, capped at 30-31 GB to keep compressed object pointers. The rest of RAM should go to the OS page cache for Lucene's memory-mapped segments. A 64 GB host gets 31 GB heap and ~33 GB page cache.
Q: What are Elasticsearch node tiers (hot, warm, cold, frozen)?
A: Node tiers attach a role (data_hot, data_warm, data_cold, data_frozen) that ILM uses to migrate indices through their lifecycle. Hot nodes run on SSDs with high heap for active writes; cold/frozen nodes use cheap disk and serve rarely-queried indices, often via searchable snapshots.
Related Reading
- What is Elasticsearch Index: shards live on nodes
- What is Elasticsearch Mapping: cluster state held by master
- What is Elasticsearch Refresh Interval: per-shard refresh on data nodes
- What is Elasticsearch Query Cache: per-node cache behavior
- Elasticsearch Pros and Cons: system-level trade-offs