An Elasticsearch analyzer is the component that converts a text field into the tokens stored in the inverted index. Every analyzer runs three stages in order: zero or more character filters, exactly one tokenizer, and zero or more token filters. The same analyzer applies at index time and (by default) at query time, which is why a mismatch between the two breaks search relevance in subtle ways.
How an Elasticsearch Analyzer Works
The three-stage pipeline is fixed. Each stage receives the output of the previous stage and feeds the next:
- Character filters transform the raw text stream before tokenization. Examples:
html_stripremoves HTML tags,mappingsubstitutes characters,pattern_replaceapplies a regex. - Tokenizer splits the stream into tokens. The choice of tokenizer defines word boundaries. The default
standardtokenizer follows Unicode Text Segmentation (UAX #29). - Token filters modify, add, or remove tokens.
lowercase,stop,asciifolding,synonym,stemmer, andngramare the most common.
You test any analyzer against a sample string with the _analyze API:
POST _analyze
{
"analyzer": "standard",
"text": "The Quick Brown Fox!"
}
The response lists each token, its start/end offsets, position, and type. This is the single most useful debugging tool for analyzer behavior.
Built-in Analyzers
Elasticsearch ships with several analyzers covering the common cases. Use these before reaching for a custom one.
| Analyzer | Tokenizer | Token filters | Typical use |
|---|---|---|---|
standard (default) |
standard |
lowercase |
General-purpose Unicode text |
simple |
lowercase (also splits) |
- | Lowercased, letters-only tokens |
whitespace |
whitespace |
- | Code, identifiers, log fragments |
keyword |
keyword (no split) |
- | Treat full input as one token |
stop |
lowercase |
stop |
Drop English stop words |
pattern |
pattern (regex) |
lowercase |
Custom split rules via regex |
english, french, ... |
language-specific | language-specific | Stemming and stop words per language |
Language analyzers apply stemming (e.g., "running" -> "run") and language-specific stop word lists. Pick the right language analyzer over standard for any field where users search in natural language.
Custom Analyzer Configuration
Define a custom analyzer in the index analysis settings, then reference it from a field mapping:
PUT /products
{
"settings": {
"analysis": {
"analyzer": {
"product_search": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "english_stop"]
}
},
"filter": {
"english_stop": { "type": "stop", "stopwords": "_english_" }
}
}
},
"mappings": {
"properties": {
"description": { "type": "text", "analyzer": "product_search" }
}
}
}
You can also set search_analyzer separately from analyzer when index-time and query-time processing should differ - the most common case is using edge_ngram at index time and a plain standard analyzer at search time for autocomplete.
Common Pitfalls with Analyzers
- Changing the analyzer on an existing field without reindexing. Field analyzers are baked into the mapping at field creation. Updating the setting only affects new documents in some cases, and only with explicit reindex in others.
- Mismatched index-time and search-time analyzers. Stemming "running" to "run" at index time but searching for "running" without stemming returns nothing.
- Wrong language analyzer for the data. The
englishanalyzer applied to French content removes the wrong stop words and stems incorrectly. - Overcomplicated custom analyzers.
keywordpluslowercasesolves more problems than people think. - Forgetting that
keywordfields are not analyzed. Use thenormalizersetting onkeywordfields if you need lowercase or asciifolding for exact-match comparison.
Monitoring Analyzer Behavior in Production
Analyzer misconfiguration usually shows up as a relevance regression - users complain that searches return fewer or worse results - rather than an outright failure. Watch indexing latency on text-heavy fields (complex token filters cost CPU), segment sizes (overly aggressive ngramming inflates the index), and zero-result query rates after any mapping change.
Pulse tracks index latency, segment growth, and search relevance signals across your Elasticsearch cluster, and surfaces anomalies that often trace back to analyzer changes. Its agentic root-cause analysis correlates a relevance regression with the recent mapping update that caused it, so analyzer-related issues don't quietly degrade search quality.
Frequently Asked Questions
Q: What is the difference between an analyzer and a tokenizer in Elasticsearch?
A: A tokenizer is one stage inside an analyzer. The analyzer is the full pipeline (character filters + tokenizer + token filters); the tokenizer only splits the input stream into tokens. An analyzer always contains exactly one tokenizer.
Q: What is the default analyzer in Elasticsearch?
A: The default analyzer is the standard analyzer. It uses the standard tokenizer (Unicode Text Segmentation) followed by the lowercase token filter. The stop filter is available but disabled by default.
Q: Can I change an analyzer on an existing Elasticsearch field?
A: Not directly. The analyzer on a text field is fixed once the field is mapped. You can update search_analyzer dynamically, but to change the index-time analyzer you need to reindex into a new index with the new mapping.
Q: How do I test what an analyzer does to my text?
A: Use the _analyze API: POST /my-index/_analyze with the analyzer name (or inline definition) and the input text. The response shows every token plus its offsets, position, and type.
Q: When should I create a custom analyzer instead of using a built-in one?
A: Build a custom analyzer when you need to combine specific char filters, tokenizers, and token filters not covered by the built-ins. Examples: stripping HTML before tokenizing, applying synonyms, using edge_ngrams for autocomplete, or combining language stemming with custom stop word lists.
Q: What is the difference between analyzer and normalizer?
A: analyzer applies to text fields and can produce multiple tokens. normalizer applies to keyword fields and produces exactly one token (no tokenization), but can apply char filters and a limited set of token filters like lowercase. Use a normalizer for case-insensitive exact matching on keyword fields.
Related Reading
- Elasticsearch Tokenizer vs Analyzer: clear breakdown of how the two relate
- What is Elasticsearch Mapping: how analyzers attach to fields
- Elasticsearch Dynamic Mapping: how new text fields get default analyzers
- What is Elasticsearch Fielddata: why analyzed text fields are expensive for sorting/aggregations
- Elasticsearch Aggregation Types: related text-analysis considerations