Elasticsearch Tokenizer vs Analyzer: The Difference, Explained

In Elasticsearch, a tokenizer is one stage inside an analyzer. The analyzer is the whole text-processing pipeline (zero or more character filters, exactly one tokenizer, zero or more token filters). The tokenizer's specific job is to split an incoming character stream into tokens; everything else - HTML stripping, lowercasing, stemming, stop-word removal - happens around it. Saying "I'm using the standard tokenizer" and "I'm using the standard analyzer" mean very different things in practice.

Quick Comparison

Aspect Tokenizer Analyzer
Role Splits text into tokens Full text-processing pipeline
Components Single component Char filters + 1 tokenizer + token filters
Output Stream of tokens Stream of indexed terms
Configured via tokenizer setting analyzer setting on a field
Used in mapping Indirectly (inside analyzer) Directly (on text fields)
Examples standard, whitespace, keyword, ngram, path_hierarchy standard, english, simple, custom analyzers
Count per analyzer Exactly 1 n/a
Customizable Limited (parameters only) Fully (mix-and-match components)

The Analyzer Pipeline

Every Elasticsearch text analyzer runs three stages, in order:

  1. Character filters (0+) - operate on the raw character stream before tokenization. html_strip removes HTML tags, mapping replaces characters, pattern_replace runs a regex.
  2. Tokenizer (exactly 1) - splits the filtered stream into tokens. Each token carries position, offsets, and a type.
  3. Token filters (0+) - modify the token stream. lowercase, stop, stemmer, asciifolding, synonym, ngram, edge_ngram are the common ones.
"<p>The Quick Brown Foxes</p>"
   |
   v  [char_filter: html_strip]
"The Quick Brown Foxes"
   |
   v  [tokenizer: standard]
[The] [Quick] [Brown] [Foxes]
   |
   v  [filter: lowercase]
[the] [quick] [brown] [foxes]
   |
   v  [filter: stemmer (english)]
[the] [quick] [brown] [fox]

The tokenizer alone produced [The, Quick, Brown, Foxes]. The analyzer produced [the, quick, brown, fox]. The difference is the work of the surrounding filters.

Built-in Tokenizers

Elasticsearch ships with a range of tokenizers. The right choice depends on what "a token" means for your data.

Tokenizer Splits on Typical use
standard (default) Unicode word boundaries (UAX #29) General-purpose text
whitespace Whitespace only Identifiers, code, log fragments
keyword Never (whole input = one token) Already-tokenized text, IDs
pattern Regex match Custom split rules
letter Non-letter characters Letters-only tokens
lowercase Same as letter + lowercases Simple analyzer's tokenizer
ngram N-character substrings Partial-match, fuzzy search
edge_ngram N-character prefixes Autocomplete
path_hierarchy / (or configured separator) File paths, hierarchical IDs
simple_pattern / simple_pattern_split Limited regex Specialized splits
char_group Configured set of characters Domain-specific delimiters
thai, nori, kuromoji, etc. Language-specific rules Languages without whitespace boundaries (Thai, Japanese, Korean, etc.)

You configure a tokenizer inside a custom analyzer:

PUT /paths
{
  "settings": {
    "analysis": {
      "analyzer": {
        "path_analyzer": {
          "type":      "custom",
          "tokenizer": "path_hierarchy"
        }
      }
    }
  }
}

When to Customize the Tokenizer vs the Whole Analyzer

Customize just the tokenizer when the default split rules don't fit your data: file paths, code identifiers, or domain-specific delimiters. You're saying "the way to break text into tokens is different here, but the rest of the pipeline is fine."

Customize the full analyzer when you need a different mix of filters: synonyms, stemming, multi-language stop words, or unusual character normalization. Most production cases are full custom analyzers, because real text needs multiple processing stages.

PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "product_text": {
          "type":        "custom",
          "char_filter": ["html_strip"],
          "tokenizer":   "standard",
          "filter":      ["lowercase", "asciifolding", "english_stop", "english_stemmer"]
        }
      },
      "filter": {
        "english_stop":     { "type": "stop",    "stopwords": "_english_" },
        "english_stemmer":  { "type": "stemmer", "language":  "english"  }
      }
    }
  }
}

Testing Tokenizers and Analyzers

The _analyze API works for both:

# Test a tokenizer alone
POST /_analyze
{
  "tokenizer": "standard",
  "text":      "The quick brown foxes"
}

# Test a full analyzer
POST /_analyze
{
  "analyzer": "english",
  "text":     "The quick brown foxes"
}

# Test a custom pipeline inline
POST /_analyze
{
  "char_filter": ["html_strip"],
  "tokenizer":   "standard",
  "filter":      ["lowercase", "stop"],
  "text":        "<p>The Quick Brown Foxes</p>"
}

The response lists tokens with positions and offsets - which is how you confirm the analyzer is doing what you think it's doing.

Common Confusion Points

  1. "The standard analyzer uses the standard tokenizer." True, but it also adds the lowercase token filter. The standard analyzer produces lowercased tokens; the standard tokenizer alone does not.
  2. "I can use multiple tokenizers in one analyzer." No. Exactly one tokenizer per analyzer. To get multi-tokenizer behavior, use multi-fields with different analyzers.
  3. "Tokenizers are slower than analyzers." Analyzers always do more work than tokenizers because they include the tokenizer plus filters.
  4. "keyword field uses the keyword tokenizer." Sort of, but keyword fields don't go through the analyzer pipeline at all - they use a normalizer (which only allows specific char filters and token filters, not tokenizers).

Production Monitoring

Analyzer changes are easy to make and hard to evaluate - search relevance regressions can be subtle, and indexing latency on text-heavy fields shifts with token-filter complexity. Pulse tracks indexing latency, segment growth, and search behavior across your Elasticsearch cluster, and ties anomalies back to recent mapping or analyzer changes. When a relevance regression follows a custom-analyzer rollout, Pulse's automated root-cause analysis points at the change.

Frequently Asked Questions

Q: What is the difference between a tokenizer and an analyzer in Elasticsearch?
A: A tokenizer splits text into tokens (one stage). An analyzer is the full pipeline that wraps a tokenizer with character filters before it and token filters after it. Every analyzer contains exactly one tokenizer.

Q: Can I use multiple tokenizers in one Elasticsearch analyzer?
A: No. An analyzer contains exactly one tokenizer. To get the effect of multiple tokenization strategies, use multi-fields: map the same source value under multiple field names, each with its own analyzer.

Q: What is the default tokenizer in Elasticsearch?
A: The standard tokenizer is the default. It implements Unicode Text Segmentation (UAX #29) and splits on word boundaries that work for most Western languages.

Q: How do I test what a tokenizer or analyzer does?
A: Use the _analyze API. Specify either tokenizer (to test the tokenizer alone) or analyzer (to test the full pipeline), along with the text. The response lists each token, its offsets, and its position.

Q: Can I create a custom tokenizer in Elasticsearch?
A: You can configure existing tokenizers (pattern, char_group, ngram, edge_ngram) with custom parameters. Writing a brand-new tokenizer from scratch requires implementing a Lucene Tokenizer in Java and packaging it as a plugin. For most cases, parameterizing a built-in tokenizer is enough.

Q: Do keyword fields use a tokenizer or analyzer?
A: Neither, by default. keyword fields store the input as a single token without analysis. You can apply a normalizer for case-folding and character normalization, but normalizers don't include a tokenizer (the field stays one token).

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.