In Elasticsearch, a tokenizer is one stage inside an analyzer. The analyzer is the whole text-processing pipeline (zero or more character filters, exactly one tokenizer, zero or more token filters). The tokenizer's specific job is to split an incoming character stream into tokens; everything else - HTML stripping, lowercasing, stemming, stop-word removal - happens around it. Saying "I'm using the standard tokenizer" and "I'm using the standard analyzer" mean very different things in practice.
Quick Comparison
| Aspect | Tokenizer | Analyzer |
|---|---|---|
| Role | Splits text into tokens | Full text-processing pipeline |
| Components | Single component | Char filters + 1 tokenizer + token filters |
| Output | Stream of tokens | Stream of indexed terms |
| Configured via | tokenizer setting |
analyzer setting on a field |
| Used in mapping | Indirectly (inside analyzer) | Directly (on text fields) |
| Examples | standard, whitespace, keyword, ngram, path_hierarchy |
standard, english, simple, custom analyzers |
| Count per analyzer | Exactly 1 | n/a |
| Customizable | Limited (parameters only) | Fully (mix-and-match components) |
The Analyzer Pipeline
Every Elasticsearch text analyzer runs three stages, in order:
- Character filters (0+) - operate on the raw character stream before tokenization.
html_stripremoves HTML tags,mappingreplaces characters,pattern_replaceruns a regex. - Tokenizer (exactly 1) - splits the filtered stream into tokens. Each token carries position, offsets, and a type.
- Token filters (0+) - modify the token stream.
lowercase,stop,stemmer,asciifolding,synonym,ngram,edge_ngramare the common ones.
"<p>The Quick Brown Foxes</p>"
|
v [char_filter: html_strip]
"The Quick Brown Foxes"
|
v [tokenizer: standard]
[The] [Quick] [Brown] [Foxes]
|
v [filter: lowercase]
[the] [quick] [brown] [foxes]
|
v [filter: stemmer (english)]
[the] [quick] [brown] [fox]
The tokenizer alone produced [The, Quick, Brown, Foxes]. The analyzer produced [the, quick, brown, fox]. The difference is the work of the surrounding filters.
Built-in Tokenizers
Elasticsearch ships with a range of tokenizers. The right choice depends on what "a token" means for your data.
| Tokenizer | Splits on | Typical use |
|---|---|---|
standard (default) |
Unicode word boundaries (UAX #29) | General-purpose text |
whitespace |
Whitespace only | Identifiers, code, log fragments |
keyword |
Never (whole input = one token) | Already-tokenized text, IDs |
pattern |
Regex match | Custom split rules |
letter |
Non-letter characters | Letters-only tokens |
lowercase |
Same as letter + lowercases |
Simple analyzer's tokenizer |
ngram |
N-character substrings | Partial-match, fuzzy search |
edge_ngram |
N-character prefixes | Autocomplete |
path_hierarchy |
/ (or configured separator) |
File paths, hierarchical IDs |
simple_pattern / simple_pattern_split |
Limited regex | Specialized splits |
char_group |
Configured set of characters | Domain-specific delimiters |
thai, nori, kuromoji, etc. |
Language-specific rules | Languages without whitespace boundaries (Thai, Japanese, Korean, etc.) |
You configure a tokenizer inside a custom analyzer:
PUT /paths
{
"settings": {
"analysis": {
"analyzer": {
"path_analyzer": {
"type": "custom",
"tokenizer": "path_hierarchy"
}
}
}
}
}
When to Customize the Tokenizer vs the Whole Analyzer
Customize just the tokenizer when the default split rules don't fit your data: file paths, code identifiers, or domain-specific delimiters. You're saying "the way to break text into tokens is different here, but the rest of the pipeline is fine."
Customize the full analyzer when you need a different mix of filters: synonyms, stemming, multi-language stop words, or unusual character normalization. Most production cases are full custom analyzers, because real text needs multiple processing stages.
PUT /products
{
"settings": {
"analysis": {
"analyzer": {
"product_text": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "english_stop", "english_stemmer"]
}
},
"filter": {
"english_stop": { "type": "stop", "stopwords": "_english_" },
"english_stemmer": { "type": "stemmer", "language": "english" }
}
}
}
}
Testing Tokenizers and Analyzers
The _analyze API works for both:
# Test a tokenizer alone
POST /_analyze
{
"tokenizer": "standard",
"text": "The quick brown foxes"
}
# Test a full analyzer
POST /_analyze
{
"analyzer": "english",
"text": "The quick brown foxes"
}
# Test a custom pipeline inline
POST /_analyze
{
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": ["lowercase", "stop"],
"text": "<p>The Quick Brown Foxes</p>"
}
The response lists tokens with positions and offsets - which is how you confirm the analyzer is doing what you think it's doing.
Common Confusion Points
- "The standard analyzer uses the standard tokenizer." True, but it also adds the
lowercasetoken filter. Thestandardanalyzer produces lowercased tokens; thestandardtokenizer alone does not. - "I can use multiple tokenizers in one analyzer." No. Exactly one tokenizer per analyzer. To get multi-tokenizer behavior, use multi-fields with different analyzers.
- "Tokenizers are slower than analyzers." Analyzers always do more work than tokenizers because they include the tokenizer plus filters.
- "
keywordfield uses the keyword tokenizer." Sort of, butkeywordfields don't go through the analyzer pipeline at all - they use anormalizer(which only allows specific char filters and token filters, not tokenizers).
Production Monitoring
Analyzer changes are easy to make and hard to evaluate - search relevance regressions can be subtle, and indexing latency on text-heavy fields shifts with token-filter complexity. Pulse tracks indexing latency, segment growth, and search behavior across your Elasticsearch cluster, and ties anomalies back to recent mapping or analyzer changes. When a relevance regression follows a custom-analyzer rollout, Pulse's automated root-cause analysis points at the change.
Frequently Asked Questions
Q: What is the difference between a tokenizer and an analyzer in Elasticsearch?
A: A tokenizer splits text into tokens (one stage). An analyzer is the full pipeline that wraps a tokenizer with character filters before it and token filters after it. Every analyzer contains exactly one tokenizer.
Q: Can I use multiple tokenizers in one Elasticsearch analyzer?
A: No. An analyzer contains exactly one tokenizer. To get the effect of multiple tokenization strategies, use multi-fields: map the same source value under multiple field names, each with its own analyzer.
Q: What is the default tokenizer in Elasticsearch?
A: The standard tokenizer is the default. It implements Unicode Text Segmentation (UAX #29) and splits on word boundaries that work for most Western languages.
Q: How do I test what a tokenizer or analyzer does?
A: Use the _analyze API. Specify either tokenizer (to test the tokenizer alone) or analyzer (to test the full pipeline), along with the text. The response lists each token, its offsets, and its position.
Q: Can I create a custom tokenizer in Elasticsearch?
A: You can configure existing tokenizers (pattern, char_group, ngram, edge_ngram) with custom parameters. Writing a brand-new tokenizer from scratch requires implementing a Lucene Tokenizer in Java and packaging it as a plugin. For most cases, parameterizing a built-in tokenizer is enough.
Q: Do keyword fields use a tokenizer or analyzer?
A: Neither, by default. keyword fields store the input as a single token without analysis. You can apply a normalizer for case-folding and character normalization, but normalizers don't include a tokenizer (the field stays one token).
Related Reading
- What is an Elasticsearch Analyzer: full analyzer reference
- What is Elasticsearch Mapping: how analyzers attach to fields
- Elasticsearch Dynamic Mapping: default analyzers for auto-detected fields
- What is Elasticsearch Fielddata: why analyzed text is expensive for aggregations
- What is Elasticsearch Index: per-index analyzer definitions