Running Elasticsearch on Kubernetes through the ECK operator simplifies deployments but introduces its own failure modes. Most production incidents trace back to a small set of recurring problems.
Pod Scheduling Failures
The most common reason an Elasticsearch pod stays in Pending is that the scheduler cannot find a node satisfying the pod's resource requests. ECK lets you specify CPU and memory requests in the podTemplate section, and teams frequently copy example manifests with requests like 4Gi memory and 2 CPU without checking whether their nodes can accommodate those values after accounting for system pods, DaemonSets, and other workloads.
Check scheduling failures with kubectl describe pod <pod-name> and look at the Events section. The messages are specific: Insufficient memory, Insufficient cpu, or 0/N nodes are available: N node(s) didn't match Pod's node affinity/selector. If you are using node affinity or topology spread constraints to pin Elasticsearch pods to specific node pools, verify that the labels on your nodes match the selectors in your spec. A typo in a label key silently prevents scheduling.
Local PersistentVolumes add another wrinkle. During rolling upgrades, ECK deletes and recreates pods one at a time. If a pod was bound to a local PV on a specific node and that node is full or cordoned, the new pod cannot be scheduled because the PVC is already bound to the original node's PV. This is a Kubernetes storage limitation, not an ECK bug. Either use network-attached storage (EBS, Persistent Disk, Azure Disk) or accept that local PV setups require careful capacity planning per node.
OOMKilled and JVM Heap Misconfiguration
An Elasticsearch pod that repeatedly gets OOMKilled almost always has a mismatch between container memory limits and JVM heap size. Elasticsearch 7.11+ auto-sizes heap to roughly 50% of available memory, but "available memory" means the container's memory limit as seen by the JVM's ergonomics. If you set resources.limits.memory: 4Gi without explicitly configuring heap, Elasticsearch allocates about 2GB to heap and expects the remaining 2GB for off-heap usage - Lucene segment caches, network buffers, and direct byte buffers.
The problem arises when you set resource requests without setting limits. The JVM may detect the full node memory instead of the cgroup limit, allocate a massive heap, and then get killed when actual usage exceeds the cgroup boundary. Always set both requests and limits to the same value for memory:
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
requests:
memory: 4Gi
limits:
memory: 4Gi
env:
- name: ES_JAVA_OPTS
value: "-Xms2g -Xmx2g"
If you override ES_JAVA_OPTS, keep heap at or below 50% of the container memory limit. Going higher starves Lucene's page cache and off-heap structures, which paradoxically worsens performance and can trigger the OOM killer via native memory allocation. The ECK operator itself can also get OOMKilled on large clusters - bump its memory limit to 512Mi or 1Gi if you see it restarting.
PVC Sizing and Resizing
ECK creates PersistentVolumeClaims based on the volumeClaimTemplates in your Elasticsearch spec. Once created, PVCs are immutable in their storage request on most Kubernetes versions unless the StorageClass has allowVolumeExpansion: true. If you initially provisioned 100Gi and need 500Gi, you cannot just edit the Elasticsearch resource and expect ECK to resize the PVC in place.
The supported workaround is to create a new nodeSet with the desired storage size and remove the old one. ECK orchestrates the migration by adding new nodes, waiting for shard relocation, then removing old nodes. This requires enough cluster capacity to run both old and new nodes simultaneously. Plan PVC sizes generously from the start.
Some cloud providers (GKE, EKS with gp3) support online volume expansion. Verify your StorageClass has allowVolumeExpansion: true, then manually patch the PVC's spec.resources.requests.storage. ECK does not trigger this automatically.
Certificate Management and TLS Errors
ECK generates self-signed CA certificates for both the HTTP and transport layers. These certificates have a default validity period and the operator rotates them automatically before expiry. However, client applications that connect to Elasticsearch must trust ECK's CA. The CA certificate is stored in a secret named <cluster-name>-es-http-certs-public.
Extract it with:
kubectl get secret my-cluster-es-http-certs-public \
-o jsonpath='{.data.tls\.crt}' | base64 -d > ca.crt
If you use a custom CA, mount it via the http.tls.certificate configuration in the Elasticsearch resource. Transport-layer certificates are managed separately and are generally not something you should override unless you have a specific compliance requirement. Expired or misconfigured transport certificates cause nodes to reject each other - the symptom is nodes that start but never join the cluster, with SSL handshake errors in the logs.
Rolling Upgrades and CrashLoopBackOff
ECK performs rolling upgrades by cycling through pods one at a time: it disables shard allocation, stops the old pod, starts the new one, waits for it to join the cluster, re-enables allocation, and waits for green health before moving to the next node. This process stalls if any pod enters CrashLoopBackOff.
Common causes include: incompatible Elasticsearch configuration changes that prevent startup, plugin version mismatches after an upgrade, corrupted data on the PersistentVolume (rare, usually after unclean shutdown), and insufficient memory leading to immediate OOM on startup. Check kubectl logs <pod-name> --previous to see the last output before the crash.
The readiness probe also matters here. ECK configures a readiness probe that checks HTTP responsiveness with a default 3-second timeout. Under heavy load during recovery - particularly when a node is replaying its transaction log - the probe can time out, causing Kubernetes to mark the pod as not ready and potentially restart it. On Elasticsearch 8.2+, ECK uses a socket-based readiness probe on the dedicated readiness port, which is not affected by cluster load. On older versions, you may need to increase failureThreshold or timeoutSeconds in the pod template to give nodes enough time to recover large transaction logs before Kubernetes loses patience.