What is an Elasticsearch Analyzer? Components, Types, and Configuration

An Elasticsearch analyzer is the component that converts a text field into the tokens stored in the inverted index. Every analyzer runs three stages in order: zero or more character filters, exactly one tokenizer, and zero or more token filters. The same analyzer applies at index time and (by default) at query time, which is why a mismatch between the two breaks search relevance in subtle ways.

How an Elasticsearch Analyzer Works

The three-stage pipeline is fixed. Each stage receives the output of the previous stage and feeds the next:

  1. Character filters transform the raw text stream before tokenization. Examples: html_strip removes HTML tags, mapping substitutes characters, pattern_replace applies a regex.
  2. Tokenizer splits the stream into tokens. The choice of tokenizer defines word boundaries. The default standard tokenizer follows Unicode Text Segmentation (UAX #29).
  3. Token filters modify, add, or remove tokens. lowercase, stop, asciifolding, synonym, stemmer, and ngram are the most common.

You test any analyzer against a sample string with the _analyze API:

POST _analyze
{
  "analyzer": "standard",
  "text": "The Quick Brown Fox!"
}

The response lists each token, its start/end offsets, position, and type. This is the single most useful debugging tool for analyzer behavior.

Built-in Analyzers

Elasticsearch ships with several analyzers covering the common cases. Use these before reaching for a custom one.

Analyzer Tokenizer Token filters Typical use
standard (default) standard lowercase General-purpose Unicode text
simple lowercase (also splits) - Lowercased, letters-only tokens
whitespace whitespace - Code, identifiers, log fragments
keyword keyword (no split) - Treat full input as one token
stop lowercase stop Drop English stop words
pattern pattern (regex) lowercase Custom split rules via regex
english, french, ... language-specific language-specific Stemming and stop words per language

Language analyzers apply stemming (e.g., "running" -> "run") and language-specific stop word lists. Pick the right language analyzer over standard for any field where users search in natural language.

Custom Analyzer Configuration

Define a custom analyzer in the index analysis settings, then reference it from a field mapping:

PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "product_search": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "english_stop"]
        }
      },
      "filter": {
        "english_stop": { "type": "stop", "stopwords": "_english_" }
      }
    }
  },
  "mappings": {
    "properties": {
      "description": { "type": "text", "analyzer": "product_search" }
    }
  }
}

You can also set search_analyzer separately from analyzer when index-time and query-time processing should differ - the most common case is using edge_ngram at index time and a plain standard analyzer at search time for autocomplete.

Common Pitfalls with Analyzers

  1. Changing the analyzer on an existing field without reindexing. Field analyzers are baked into the mapping at field creation. Updating the setting only affects new documents in some cases, and only with explicit reindex in others.
  2. Mismatched index-time and search-time analyzers. Stemming "running" to "run" at index time but searching for "running" without stemming returns nothing.
  3. Wrong language analyzer for the data. The english analyzer applied to French content removes the wrong stop words and stems incorrectly.
  4. Overcomplicated custom analyzers. keyword plus lowercase solves more problems than people think.
  5. Forgetting that keyword fields are not analyzed. Use the normalizer setting on keyword fields if you need lowercase or asciifolding for exact-match comparison.

Monitoring Analyzer Behavior in Production

Analyzer misconfiguration usually shows up as a relevance regression - users complain that searches return fewer or worse results - rather than an outright failure. Watch indexing latency on text-heavy fields (complex token filters cost CPU), segment sizes (overly aggressive ngramming inflates the index), and zero-result query rates after any mapping change.

Pulse tracks index latency, segment growth, and search relevance signals across your Elasticsearch cluster, and surfaces anomalies that often trace back to analyzer changes. Its agentic root-cause analysis correlates a relevance regression with the recent mapping update that caused it, so analyzer-related issues don't quietly degrade search quality.

Frequently Asked Questions

Q: What is the difference between an analyzer and a tokenizer in Elasticsearch?
A: A tokenizer is one stage inside an analyzer. The analyzer is the full pipeline (character filters + tokenizer + token filters); the tokenizer only splits the input stream into tokens. An analyzer always contains exactly one tokenizer.

Q: What is the default analyzer in Elasticsearch?
A: The default analyzer is the standard analyzer. It uses the standard tokenizer (Unicode Text Segmentation) followed by the lowercase token filter. The stop filter is available but disabled by default.

Q: Can I change an analyzer on an existing Elasticsearch field?
A: Not directly. The analyzer on a text field is fixed once the field is mapped. You can update search_analyzer dynamically, but to change the index-time analyzer you need to reindex into a new index with the new mapping.

Q: How do I test what an analyzer does to my text?
A: Use the _analyze API: POST /my-index/_analyze with the analyzer name (or inline definition) and the input text. The response shows every token plus its offsets, position, and type.

Q: When should I create a custom analyzer instead of using a built-in one?
A: Build a custom analyzer when you need to combine specific char filters, tokenizers, and token filters not covered by the built-ins. Examples: stripping HTML before tokenizing, applying synonyms, using edge_ngrams for autocomplete, or combining language stemming with custom stop word lists.

Q: What is the difference between analyzer and normalizer?
A: analyzer applies to text fields and can produce multiple tokens. normalizer applies to keyword fields and produces exactly one token (no tokenization), but can apply char filters and a limited set of token filters like lowercase. Use a normalizer for case-insensitive exact matching on keyword fields.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.