Elasticsearch Regexp Query: Syntax, Parameters, and Examples - Syntax, Example, and Tips

The Elasticsearch regexp query matches documents whose indexed terms match a Lucene regular expression. It operates on individual terms in the inverted index, so it is almost always used on keyword (or wildcard) fields - on analyzed text fields it matches single tokens, not the original string. Patterns are anchored implicitly: the regex must match the whole term. Regexp queries are flagged as expensive and are rejected when search.allow_expensive_queries: false.

Syntax

GET /index/_search
{
  "query": {
    "regexp": {
      "field_name": {
        "value": "pattern",
        "flags": "ALL",
        "case_insensitive": false,
        "max_determinized_states": 10000,
        "rewrite": "constant_score"
      }
    }
  }
}

Parameters

Parameter Description Required Default
value Lucene regular expression. Capped at 1,000 characters by default. Yes -
flags Pipe-separated flags: ALL, COMPLEMENT, INTERVAL, INTERSECTION, ANYSTRING, NONE. Controls which optional syntax features are enabled. No ALL
case_insensitive ASCII case-insensitive matching. Available since Elasticsearch 7.10. No false
max_determinized_states Cap on automaton states; protects against catastrophic regexes. No 10000
rewrite Multi-term rewrite method (constant_score, top_terms_N, etc.). No constant_score
boost Score multiplier when used outside constant_score rewrite. No 1.0

The pattern is anchored: k.*y matches a term that starts with k and ends with y. There is no implicit "contains" semantics - use .*foo.* to emulate that, with care.

Examples

Match usernames starting with john followed by digits:

GET /users/_search
{
  "query": {
    "regexp": {
      "username": {
        "value": "john[0-9]+"
      }
    }
  }
}

Case-insensitive match:

GET /products/_search
{
  "query": {
    "regexp": {
      "sku.keyword": {
        "value": "ax-[a-z0-9]{4}",
        "case_insensitive": true
      }
    }
  }
}

Bounded "contains" pattern combined with a cheap pre-filter:

GET /logs-*/_search
{
  "query": {
    "bool": {
      "filter": [
        { "range":  { "@timestamp": { "gte": "now-1h/m" } } },
        { "term":   { "service":    "checkout" } },
        { "regexp": {
            "message.keyword": {
              "value": ".*timeout=[0-9]+ms.*",
              "max_determinized_states": 20000
            }
          }
        }
      ]
    }
  }
}

Use INTERVAL flag for {m,n} ranges (not enabled by default with flags: NONE):

GET /domains/_search
{
  "query": {
    "regexp": {
      "domain.keyword": {
        "value": "ex<5-9>",
        "flags": "INTERVAL"
      }
    }
  }
}

Performance and Use Notes

A regexp query compiles to a Lucene automaton and then walks the field's term dictionary, accepting every term the automaton accepts. Cost grows with the number of accepted terms and the determinization size of the automaton. Patterns with a fixed prefix (prefix.*) are tractable because the term dictionary's sorted layout lets the search jump to the right slice. Patterns with a leading .* force a scan of every term in the field per shard - the same trap as a leading wildcard.

max_determinized_states (default 10000) caps automaton complexity and protects the cluster from exponentially expensive regexes. Increase it only when you understand the pattern. Setting search.allow_expensive_queries: false blocks regexp queries entirely; this is a sensible production default for public-facing endpoints.

Regexp queries are a recurring source of CPU spikes and slow searches. Walking the slow log to isolate which regex patterns are doing full term-dictionary scans, and which would compile cheaper as a prefix or wildcard query, is the kind of repetitive triage Pulse automates.

Common Mistakes

  1. Forgetting that the pattern is anchored. cat does not match the term concatenate - use .*cat.*.
  2. Running regexp on an analyzed text field and expecting whole-string matching. It matches per token. Add a .keyword sub-field.
  3. Leading .* on a large field, sometimes hidden inside helper code. Audit incoming patterns before running them.
  4. Hitting max_determinized_states and treating it as a regexp bug. The cap is doing its job - rewrite the pattern.
  5. Allowing user-supplied regexes on a public endpoint. Disable search.allow_expensive_queries cluster-wide.

Find Slow Regexp Queries with Pulse

Pulse is an AI DBA for Elasticsearch and OpenSearch that continuously profiles production query traffic. For regexp queries specifically, Pulse:

  • Identifies regexp queries that hit max_determinized_states, build expensive automatons, or contain a leading .* that forces a full term-dictionary scan per shard
  • Flags regexp clauses running against analyzed text fields where the intent was clearly whole-string matching on a .keyword sub-field
  • Surfaces patterns rejected by search.allow_expensive_queries: false and the services still attempting them
  • Traces each slow regexp back to the calling service via slow-log and APM correlation
  • Recommends concrete rewrites: bounded prefix patterns, migration to the wildcard data type, replacing the regex with a terms query over an enumerable set, or pinning a fixed prefix to anchor the automaton
  • Tracks latency and CPU improvement after the rewrite ships

This converts the manual slow-log triage loop into a continuous optimization workflow.

Try Pulse on your cluster.

Frequently Asked Questions

Q: How is the regexp query different from a wildcard query?
A: A wildcard query supports only * and ? and runs slightly faster on simple patterns. A regexp query supports the full Lucene regex grammar (character classes, alternation, repetition, intervals) but builds and walks a larger automaton, so it is generally more expensive.

Q: Does the regexp query anchor patterns automatically?
A: Yes. The pattern must match the entire term. k.*y is implicitly anchored at both ends, so it matches terms beginning with k and ending with y. Use .*pattern.* for contains-style matching.

Q: How do I make a regexp query case-insensitive?
A: Set case_insensitive: true (supported since Elasticsearch 7.10). For older clusters, lowercase the value at index time with a lowercase normalizer (keyword) or analyzer (text) and lowercase the regex.

Q: Can I run a regexp query on numeric or date fields?
A: No. Regexp queries are designed for text-based fields and operate on indexed terms. Use a range query for numeric, date, and IP fields.

Q: What does max_determinized_states control?
A: It caps the number of states in the deterministic finite automaton built from the regex. The default is 10000. Exceeding it throws an error rather than letting an exponential automaton blow up the JVM heap.

Q: Why are regexp queries rejected on my cluster?
A: Likely search.allow_expensive_queries is set to false. Regexp, wildcard, prefix without index_prefixes, fuzzy, joining, and a few other query types are all considered expensive and blocked when that setting is off.

Q: How do I find which regexp queries are causing CPU spikes in production?
A: Pulse ingests Elasticsearch and OpenSearch slow logs, isolates regexp queries with leading .* or large determinized automatons, correlates each one to the calling service, and recommends a cheaper rewrite - often a prefix query, a wildcard query backed by the wildcard data type, or a bounded terms query.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.