Elasticsearch HTTP 500 Internal Server Error

An HTTP 500 from Elasticsearch signals an unhandled exception on the server side. Unlike 4xx errors that indicate client mistakes, a 500 means something broke inside Elasticsearch itself. These errors often point to data corruption, bugs, or environmental problems that will not resolve on their own.

What Triggers a 500

Several categories of failure produce 500 responses. Corrupt Lucene segments are the most damaging - when the underlying index files fail checksum validation or contain inconsistent data, any search or indexing operation that touches the affected shard throws a CorruptIndexException. This typically happens after unclean shutdowns, disk failures, or hardware-level data corruption.

Out-of-memory conditions during request processing can also trigger 500s. Elasticsearch's circuit breakers reject requests before memory is exhausted, but they do not cover every allocation path. If a request slips past the breakers and causes an OOM during execution, the error surfaces as a 500.

Plugin failures are another source. A plugin that throws an uncaught exception during query parsing, custom scoring, or ingest pipeline processing will produce a 500 with the plugin's exception in the stack trace. Mapping conflicts can also cause 500s. If a ClassCastException appears in the response, it often means a field was indexed with one type but queried with operations that assume another. This is more common in clusters where dynamic mapping produced unexpected field types.

Reading the Stack Trace

Elasticsearch includes a truncated stack trace in the JSON response body when a 500 occurs:

{
  "error": {
    "root_cause": [{
      "type": "corrupt_index_exception",
      "reason": "checksum failed (hardware problem?): expected=abc123 actual=def456"
    }],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "caused_by": {
      "type": "corrupt_index_exception",
      "reason": "checksum failed (hardware problem?): expected=abc123 actual=def456"
    }
  },
  "status": 500
}

The root_cause array gives you the original exception. The outer type field tells you which phase failed. When you see search_phase_execution_exception, the failure happened during query execution. The caused_by chain can be several levels deep - follow it to the bottom to find the actual problem.

The response body alone is often not enough. Elasticsearch truncates stack traces in HTTP responses. The full exception chain, including line numbers and nested causes, is written to elasticsearch.log on the node that handled the request. Search the log for the same timestamp or exception type to find the complete trace.

Common Root Causes

Corrupt Lucene segments are the most frequent root cause of persistent 500s. The CorruptIndexException message typically includes "checksum failed" or "file mismatch," indicating that segment files on disk do not match their expected state. Power outages, kernel panics, or faulty storage controllers are common triggers. Running GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason reveals UNASSIGNED shards with ALLOCATION_FAILED in the reason field.

Incompatible plugin versions produce 500s that appear immediately after a cluster upgrade. If a plugin was compiled against an older Elasticsearch version, it may throw NoSuchMethodError or ClassNotFoundException during request handling. Check GET /_cat/plugins?v to verify all plugins match the cluster version.

ClassCastException errors typically surface during aggregation or script execution. A field mapped as keyword but containing data that a script tries to cast to long will fail this way. These errors are reproducible and tied to specific queries rather than specific shards.

Recovery Approaches

For corrupt shards, the recovery strategy depends on whether replicas exist. If a shard has healthy replicas, deleting the corrupt primary copy allows Elasticsearch to promote a replica. Use the cluster reroute API:

POST /_cluster/reroute
{
  "commands": [{
    "allocate_stale_primary": {
      "index": "my-index",
      "shard": 0,
      "node": "node-2",
      "accept_data_loss": true
    }
  }]
}

The accept_data_loss: true flag is required because a stale copy may be behind the failed primary. If no replicas exist, restoring from a snapshot is the only option that avoids data loss. The elasticsearch-shard CLI tool can run Lucene's CheckIndex against a shard directory to identify and remove corrupt segments with the -exorcise flag, but this discards the damaged data.

For plugin-related 500s, upgrade or remove the incompatible plugin. For mapping conflicts, identify the affected field from the exception details, then fix the query or reindex the data with the correct mapping.

Distinguishing 500 from 503

A 500 and a 503 communicate different problems. A 500 means the server encountered an unexpected condition - a bug, corruption, or environmental failure. It typically will not resolve by retrying the same request. A 503 means the service is temporarily unable to handle the request, usually due to the node starting up, shutting down, or the master node being unavailable.

Thread pool rejections, where Elasticsearch's search or write queues are full, return 429 (Too Many Requests) in current versions, not 503. Older versions mapped EsRejectedExecutionException to 503, so pre-7.x clusters may still produce 503 for overload scenarios. Circuit breaker trips also return 429 in modern Elasticsearch.

The practical distinction: 500s require investigation and manual intervention. Retrying will produce the same failure. 503s and 429s are transient and should be retried with backoff. If your monitoring groups all 5xx errors together, you lose this signal. Track 500s and 503s separately to distinguish corruption from capacity problems.