The Logstash grok filter parses unstructured text into named fields using a library of pre-compiled regex patterns. Patterns follow the form %{PATTERN:field_name:type}, for example %{IP:client} or %{NUMBER:bytes:int}. Logstash ships hundreds of built-in patterns under vendor/bundle/jruby/*/gems/logstash-patterns-core-*/patterns/, organized by use case (apache, syslog, java, postgresql). Grok is the right tool when a log format is too irregular for dissect and not JSON.
Syntax
filter {
grok {
match => { "message" => "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes:int} %{NUMBER:duration:float}" }
patterns_dir => [ "/etc/logstash/patterns" ]
pattern_definitions => { "CUSTOM_ID" => "[A-Z]{3}-\\d{6}" }
break_on_match => true
keep_empty_captures => false
tag_on_failure => [ "_grokparsefailure" ]
timeout_millis => 30000
ecs_compatibility => "v8"
}
}
Named captures follow the %{PATTERN:name} form, with optional type coercion via :int or :float. Plain Oniguruma named groups (?<name>regex) are also supported for one-off patterns.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
match |
hash | yes | none | Map of field name to pattern (or array of patterns). |
patterns_dir |
array | no | [] |
Extra directories of custom pattern files. |
patterns_files_glob |
string | no | * |
Glob for pattern files inside patterns_dir. |
pattern_definitions |
hash | no | {} |
Inline patterns defined directly in the filter config. |
break_on_match |
boolean | no | true |
If true, stop at first matching pattern. Set false to try every pattern in an array. |
keep_empty_captures |
boolean | no | false |
Keep optional captures that produced no value. |
named_captures_only |
boolean | no | true |
Only capture explicitly named groups. |
tag_on_failure |
array | no | ["_grokparsefailure"] |
Tags added when no pattern matches. |
tag_on_timeout |
string | no | _groktimeout |
Tag added when a single match exceeds timeout_millis. |
timeout_millis |
number | no | 30000 |
Per-event match timeout. Protects against catastrophic backtracking. |
ecs_compatibility |
string | no | depends on version | disabled, v1, or v8. ECS-aware built-in patterns place fields under ECS namespaces. |
Examples
Parse an Apache combined access log into ECS-compliant fields:
filter {
grok {
match => { "message" => "%{HTTPD_COMBINEDLOG}" }
ecs_compatibility => "v8"
}
}
In ecs_compatibility => v8 mode, fields are emitted as [source][address], [http][request][method], [url][original], etc.
Try multiple patterns to handle a mixed-format input:
filter {
grok {
match => {
"message" => [
"%{COMMONAPACHELOG}",
"%{SYSLOGLINE}",
"%{TIMESTAMP_ISO8601:ts} %{LOGLEVEL:level} %{GREEDYDATA:msg}"
]
}
break_on_match => true
}
}
Define a custom pattern inline and reference it:
filter {
grok {
pattern_definitions => {
"ORDER_ID" => "ORD-\\d{8}"
}
match => { "message" => "order=%{ORDER_ID:order_id} user=%{USERNAME:user}" }
}
}
Common Issues
Unmatched events are tagged _grokparsefailure and pass through with the original message untouched. A pipeline with rising _grokparsefailure rates usually means the upstream log format has drifted; route tagged events to a debug index to inspect them.
Anchor patterns with ^ and $ (or \A/\z) when you expect the whole line to match. Unanchored patterns can match a substring and produce surprising results, especially when break_on_match => false is in effect.
Patterns with unbounded alternation - (a|b|c|d|e)* followed by another quantifier - cause catastrophic backtracking on near-miss inputs. The timeout_millis parameter exists because a single bad pattern can otherwise stall the entire pipeline worker. If you see _groktimeout tags, simplify the alternation or replace it with dissect plus targeted grok.
ECS mode is a breaking change: built-in patterns like HTTPD_COMBINEDLOG produce different field names in disabled vs v8 mode. Pick one mode per pipeline and stick with it; mixing produces dashboards with half-populated fields.
Performance Notes
Grok patterns are compiled once at pipeline startup, not per event. The cost per event is regex matching against the compiled NFA. Three things dominate runtime:
- Number of patterns in the
matcharray - patterns are tried in order until one matches (or all fail whenbreak_on_match => false). - Pattern complexity - unbounded
*/+quantifiers with overlapping alternatives are 10-100x slower than anchored fixed-position patterns. - Failure case - a non-matching pattern usually costs more than a matching one because the engine tries every backtrack path.
For consistent, fixed-position formats, dissect is typically 2-5x faster than grok and avoids regex entirely. Use the pattern: dissect to split fixed positions, then grok only the variable-shape field that needs regex.
Monitoring Logstash Grok Pipelines with Pulse
Pulse is the only tool built specifically for monitoring and optimizing Logstash pipelines. Grok is the single largest source of Logstash CPU consumption in production, and "the pipeline got slow" is usually a grok regression - either a new pattern added with catastrophic backtracking, or upstream log format drift that pushes every event through a failure path. Pulse tracks per-filter CPU cost, _grokparsefailure and _groktimeout rates per pipeline, and correlates spikes with recent pipeline config changes so you find the bad pattern in minutes, not days.
Frequently Asked Questions
Q: Where are Logstash's built-in grok patterns stored?
A: Built-in patterns ship inside the logstash-patterns-core gem at vendor/bundle/jruby/*/gems/logstash-patterns-core-*/patterns/ (organized into ECS and legacy subdirectories from Logstash 7.12 onwards). Custom patterns live in any directory you list in patterns_dir.
Q: How does the Logstash grok filter handle ECS compatibility?
A: The ecs_compatibility parameter accepts disabled, v1, or v8. In v8 mode, built-in patterns produce ECS-compliant field names ([source][address] instead of clientip, [http][request][method] instead of verb). Pick one mode per pipeline; mixing causes inconsistent field naming downstream.
Q: What is the difference between grok and dissect in Logstash?
A: Grok uses regex and handles variable-width or irregular fields; dissect uses fixed delimiters and is 2-5x faster but cannot handle optional fields or regex matching. Use dissect for consistent application logs and grok for irregular text like nginx, syslog, or Java stack traces.
Q: How do I create custom grok patterns?
A: Either inline via pattern_definitions => { "NAME" => "regex" }, or in a separate file under a directory passed via patterns_dir. Pattern files use the format PATTERN_NAME regex one per line, and patterns can reference each other recursively.
Q: What does the _grokparsefailure tag mean?
A: None of the patterns in the match array matched the source field. The original field is untouched. The most common cause is upstream log format drift; route tagged events to a debug index and add a fallback pattern.
Q: Why does Logstash hang on certain grok patterns?
A: Catastrophic backtracking. Patterns with nested unbounded quantifiers and overlapping alternation can take exponential time on near-miss inputs. The timeout_millis parameter (default 30s) prevents this from stalling the worker indefinitely. Rewriting the pattern with anchoring, atomic groups, or dissect upfront is the long-term fix.
Related Reading
- Logstash Dissect Filter Plugin: faster, regex-free alternative for fixed-position logs.
- Logstash KV Filter Plugin: the right tool for
key=valuepayloads instead of grok. - Logstash JSON Filter Plugin: when the message is JSON, skip grok entirely.
- Logstash Date Filter Plugin: parse timestamp fields extracted by grok.
- Logstash Pipeline is Blocked Error: grok is the most common cause.