Why Elasticsearch Watcher Alerts Stop Triggering and How to Fix Them

Watcher is the built-in alerting framework in Elasticsearch. When watches stop firing, the cause is rarely obvious because the execution pipeline has multiple stages, each with its own failure modes. This article covers the watch execution model, common reasons watches go silent, and the diagnostic tools available to pinpoint the problem.

Watch Execution Model

A watch has four components: trigger, input, condition, and actions. The trigger defines when the watch runs - almost always a schedule (interval, cron, or hourly/daily/weekly). The input gathers data, typically by running a search query against one or more indices. The condition evaluates the input result and decides whether to proceed. If the condition evaluates to true, the actions execute - sending emails, calling webhooks, writing to indices, or logging.

Each execution generates a record stored in the .watcher-history-* indices. The record contains the execution time, the input payload, the condition result, and the outcome of each action. This history is the first place to look when a watch appears to have stopped.

Understanding this pipeline matters because a "not triggering" symptom can originate at any stage. The trigger engine might not be scheduling the watch. The input might return empty results due to a changed index pattern. The condition might evaluate to false every time. Or the actions might be throttled or failing silently.

Common Causes of Watches Not Firing

The most frequent cause is that the Watcher service itself is not running. Check with GET _watcher/stats. The response includes a watcher_state field that should read started. If it reads stopped or stopping, start it with POST _watcher/_start. Watcher can stop after a full cluster restart if the startup sequence encounters errors, or if it was manually stopped and never restarted.

License expiration is another common cause. Watcher requires a Gold, Platinum, or Enterprise license (or a trial license). When the license expires, Watcher stops executing watches without any obvious error in the cluster health status. Check with GET _license and verify the status is active.

Problems with the .watcher-history indices can also block execution. If these indices become read-only due to disk watermark breaches, new execution records cannot be written, and Watcher may stall. Verify with GET .watcher-history-*/_settings and check for index.blocks.write.

The condition evaluating to false is the most common "silent" failure. The watch runs on schedule, the input executes, but the condition never triggers actions. This often happens after a change to the source index mapping, a renamed field, or a shift in data volume that moves values below the threshold.

Diagnosing with Stats and Execution History

The GET _watcher/stats endpoint returns the Watcher state, the count of registered watches, and details about the execution thread pool including queue_size, max_size, and current_size. If the queue size is consistently at or near max, watches are backing up faster than they can execute.

GET _watcher/stats

// Key fields in response:
{
  "watcher_state": "started",
  "watch_count": 42,
  "execution_thread_pool": {
    "queue_size": 1000,
    "max_size": 50
  }
}

To check the history for a specific watch, query the .watcher-history-* indices filtered by watch_id. Look at the status.execution_state field - possible values include executed, throttled, awaits_execution, and failed. If you see a string of throttled entries, the issue is throttling, not a broken watch.

For immediate testing, use POST _watcher/watch/<watch_id>/_execute to manually trigger a watch. This bypasses the trigger schedule and throttle period, runs the full pipeline, and returns the execution result inline. Add "ignore_condition": true to the request body to force action execution regardless of the condition result. This helps isolate whether the problem is in the condition logic or elsewhere.

Throttle Period Blocking Repeated Execution

Watcher has a throttle mechanism that prevents the same action from firing repeatedly within a configured window. The default global throttle period is 5 seconds, controlled by xpack.watcher.execution.default_throttle_period in elasticsearch.yml.

Watch-level throttling is set in the watch definition under throttle_period at the top level. Action-level throttling is set inside each action's configuration block. Action-level throttle periods override the watch-level value for that specific action.

A common mistake is setting a long throttle period (for example, 1h) on a watch that runs every 5 minutes. The first alert fires, then all subsequent triggers within the hour show throttled in the execution history. The watch is running - it just will not repeat the action. Check the throttle_period in the watch definition with GET _watcher/watch/<watch_id> and compare it to the schedule interval.

Watcher also supports acknowledgement-based throttling. After an action fires, you can acknowledge it with PUT _watcher/watch/<watch_id>/_ack/<action_id>. Once acknowledged, the action will not fire again until the condition first returns false and then returns true again. If someone acknowledged a watch and the condition has remained continuously true, the action stays suppressed indefinitely.

Queue and Thread Pool Saturation Under Load

Watcher uses a dedicated thread pool for watch execution, separate from the trigger engine. The trigger engine fires watches according to their schedules and places them into the execution queue. Worker threads pick watches from this queue and run them through the input-condition-actions pipeline.

When the execution queue fills up - because watches take too long to execute, there are too many watches firing simultaneously, or the search input targets slow indices - new watches cannot be queued. The GET _watcher/stats response shows the current queue depth. If it frequently hits the queue_size limit, watches will be delayed or dropped.

To address saturation, reduce the number of concurrently scheduled watches by staggering their cron schedules. Optimize the input search queries so they complete faster - use filters instead of queries where possible, target specific indices rather than wildcards, and reduce the date range. If the cluster can handle it, increase xpack.watcher.thread_pool.size in elasticsearch.yml, though this trades off resources with other cluster operations.

Monitor the watcher_stats over time, not just at a single point. A queue that spikes at the top of each hour when many watches share the same cron schedule is a scheduling problem, not a capacity problem. Spread watch schedules across the interval to flatten the execution load.