Skip to main content

alerts

The following rules are being tested for each Elasticsearch cluster:

CPU usage threshold

This rule checks for nodes that run a consistently high CPU load. By default, the condition is set at 85% or more averaged over the last 5 minutes. The default rule checks on a schedule time of 1 minute with a re-notify interval of 1 day.

Disk usage threshold

This rule checks for nodes that are nearly at disk capacity. By default, the condition is set at 80% or more averaged over the last 5 minutes. The default rule checks on a schedule time of 1 minute with a re-notify interval of 1 day.

JVM memory threshold

This rule checks for nodes that use a high amount of JVM memory. By default, the condition is set at 85% or more averaged over the last 5 minutes. The default rule checks on a schedule time of 1 minute with a re-notify interval of 1 day.

Missing monitoring data

This rule checks for nodes that stop sending monitoring data. By default, the condition is set to missing for 15 minutes looking back 1 day. The default rule checks on a schedule time of 1 minute with a re-notify interval of 6 hours.

Thread pool rejections (search/write)

This rule checks for {es} nodes that experience thread pool rejections. By default, the condition is set at 300 or more over the last 5 minutes. The default rule checks on a schedule time of 1 minute with a re-notify interval of 1 day. Thresholds can be set independently for search and write type rejections.

CCR read exceptions

This rule checks for read exceptions on any of the replicated {es} clusters. The condition is met if 1 or more read exceptions are detected in the last hour. The default rule checks on a schedule time of 1 minute with a re-notify interval of 6 hours.

Large shard size

This rule checks for a large average shard size (across associated primaries) on any of the specified index patterns in an {es} cluster. The condition is met if an index's average shard size is 55gb or higher in the last 5 minutes. The default rule matches the pattern of -.* by running checks on a schedule time of 1 minute with a re-notify interval of 12 hours.

Cluster alerting

These rules check the current status of your {stack}. You can drill down into the metrics to view more information about your cluster and specific nodes, instances, and indices.

An action is triggered if any of the following conditions are met within the last minute:

  • {es} cluster health status is yellow (missing at least one replica) or red (missing at least one primary).
  • {es} version mismatch. You have {es} nodes with different versions in the same cluster.
  • {kib} version mismatch. You have {kib} instances with different versions running against the same {es} cluster.
  • Logstash version mismatch. You have Logstash nodes with different versions reporting stats to the same monitoring cluster.
  • {es} nodes changed. You have {es} nodes that were recently added or removed.