Common Elasticsearch Transform Errors and Troubleshooting

Elasticsearch transforms run continuous or batch aggregations from a source index (or index pattern) into a destination index. They are useful for building entity-centric summaries - for example, aggregating raw log events into per-user or per-session metrics. When transforms break, they can enter a failed state silently, stop producing updated data, and leave stale results in the destination index.

How Transforms Work and Fail

A transform reads from source indices using composite aggregations, processes the data in pages, and writes the results to a destination index. Continuous transforms detect changes since the last checkpoint and only re-process the modified data. Each completed pass creates a new checkpoint.

Transforms can enter several states: started, indexing, stopping, stopped, aborting, and failed. The failed state means the transform encountered an unrecoverable error and halted. The indexing state normally appears briefly during checkpoint processing, but a transform stuck in indexing for an extended period signals a problem - usually a very slow source query or resource pressure.

Checkpoint failures occur when the transform cannot complete a pass. This can happen if a source index is deleted or rolled over mid-checkpoint, if the destination index rejects documents due to mapping conflicts, or if the transform's search request times out.

Diagnosing with _transform Stats

The GET _transform/<transform_id>/_stats endpoint is the primary diagnostic tool. It returns the transform state, the current and last checkpoint details, indexed document counts, search and index times, and - critically - any error messages.

GET _transform/my-transform/_stats

// Key fields to examine:
{
  "state": "failed",
  "checkpointing": {
    "last": { "checkpoint": 42, "timestamp_millis": 1709312400000 },
    "next": { "checkpoint": 43, "position": { ... } }
  },
  "stats": {
    "documents_processed": 1500000,
    "pages_processed": 3000,
    "search_failures": 12,
    "index_failures": 3
  },
  "reason": "Could not bulk index documents: ... mapper_parsing_exception ..."
}

The reason field contains the error message that caused the failure. Common values include mapper_parsing_exception (mapping conflict in the destination), index_not_found_exception (source index deleted), and circuit_breaking_exception (memory pressure). The search_failures and index_failures counters show how many operations failed during the most recent checkpoint.

Check the checkpointing block to understand progress. If last.checkpoint has not advanced in hours or days, the transform is stuck. If next.checkpoint exists but next.position never changes, the transform is hung during processing.

Common Error Causes

Mapping conflicts in the destination index are the most frequent cause of transform failures. If you change the source data schema - adding a field, changing a field type, or altering the aggregation definition - the transform may try to write documents that conflict with the existing destination mapping. For example, a field that was previously a long in the destination cannot accept keyword values without a mapping update.

Source index deletion or rollover during a checkpoint causes index_not_found_exception. This is common when Index Lifecycle Management (ILM) deletes old indices on a schedule that overlaps with transform execution. If the transform's source is an index pattern like logs-* and ILM deletes logs-2024.01 mid-checkpoint, the transform fails because it held a reference to that index.

Insufficient permissions cause transforms to fail at creation or during execution. The user or role running the transform needs read access on source indices, read and write on the destination index, and the transform_user built-in role (or equivalent privileges). Missing permissions on the destination index often surface as security_exception in the stats reason field.

Resource Settings and Schema Evolution

Two settings control transform resource consumption. max_page_search_size sets the page size for the composite aggregation - the number of buckets fetched per search request. The default is 500, with a range of 10 to 65,536. Larger values mean fewer search requests per checkpoint but higher memory usage per request. If you hit circuit breaker exceptions, lower this value. If checkpoints are slow but the cluster has headroom, raise it.

frequency controls how often a continuous transform checks for source changes. The default is 1m. Setting it lower (like 10s) makes the destination index more up to date but puts more search load on the cluster. Setting it higher (like 15m) reduces load but increases data staleness. The minimum is 1s and the maximum is 1h.

When source data schemas evolve, the transform definition may need updating. You cannot modify a running transform's aggregation or source query in place. The process is: stop the transform, delete and recreate it with the updated definition, optionally delete the destination index to rebuild from scratch, then start the transform again. If you only need to adjust settings like frequency or max_page_search_size, you can use the POST _transform/<transform_id>/_update API without deleting the transform.

Transforms vs Rollup Jobs

Rollup jobs were deprecated in Elasticsearch 8.11 and will be removed in a future version. Transforms are the recommended replacement for most use cases. The key differences: rollup jobs use a special rollup index format that requires the rollup search API for querying, while transforms write to standard indices searchable with normal queries. Transforms support pivot and latest modes and can run continuously against live data. Rollup jobs were limited to predefined aggregation metrics and date histogram groupings.

If you are migrating from rollups, note that transforms produce standard documents rather than the compact rollup format. The destination index will be larger, but you gain full query flexibility and compatibility with dashboards and alerting. For time-series downsampling specifically, Elasticsearch now offers a dedicated downsampling feature as an ILM or data stream lifecycle action, which is the direct successor to rollup for reducing storage of time-series data.

To recover a failed transform, first check GET _transform/<transform_id>/_stats to identify the error. Address the root cause - fix the mapping, restore the missing index, or grant permissions. Then stop the transform with POST _transform/<transform_id>/_stop?force=true (the force flag is required for transforms in a failed state). If the destination index has corrupt or partial data from the failed checkpoint, you can reset the transform with POST _transform/<transform_id>/_reset, which removes the destination index and all checkpoints. Once the root cause is resolved and the transform is stopped, restart it with POST _transform/<transform_id>/_start. It will begin from the last successful checkpoint, or from scratch if reset. For transforms stuck in indexing rather than failed, a stop with wait_for_completion=true&timeout=30s gives it time to finish the current page before stopping. If it does not respond within the timeout, use force=true to terminate it immediately.