Multi-language search in Elasticsearch is solved by attaching the right language analyzer to each piece of text and then querying the analyzed sub-fields together. There is no auto-magic - language detection, analyzer selection, and per-language ranking are explicit decisions made by the index designer. The standard pattern is a multi-field mapping with one analyzed text sub-field per supported language, queried via multi_match with field-level boosts that reflect the user's locale or detected query language.
When to Use Multi-Language Patterns
| Pattern | Best for | Cost |
|---|---|---|
| Multi-field, one sub-field per language | Same content needs to be searchable under multiple analyzers | Index size grows linearly with languages |
| Distinct per-language fields | Documents store separate translations | Documents carry every translation |
| Per-language index | Mono-lingual documents, language known at ingest | Cross-language search needs an alias |
| ICU-based custom analyzer | Non-Latin scripts (CJK, Arabic, Thai) | Requires the analysis-icu plugin |
Configuring Language-Specific Analyzers
Elasticsearch ships built-in analyzers for ~35 languages, each with appropriate tokenization, stop words, and stemming. Use them directly or extend them with a custom analyzer when you need to tweak the stem dictionary or stop-word list.
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"english_custom": {
"type": "english",
"stopwords_path": "stopwords/english_custom.txt"
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"english": { "type": "text", "analyzer": "english" },
"spanish": { "type": "text", "analyzer": "spanish" }
}
}
}
}
}
For Chinese, Japanese, Korean, and Thai, the built-in analyzers are limited; install dedicated plugins (analysis-smartcn, analysis-kuromoji, analysis-nori) or use the icu_analyzer from analysis-icu as a script-aware fallback.
Querying Multi-Language Content
multi_match is the right query for searching across language sub-fields. Boost the user's preferred language higher than the others to keep relevance honest.
GET my_index/_search
{
"query": {
"multi_match": {
"query": "running shoes",
"fields": ["title.english^3", "title.spanish^1"]
}
}
}
For known mono-lingual content, route the query to a single sub-field (title.spanish) instead of all of them. The type: cross_fields mode is mostly inappropriate for multi-language search because it assumes all targeted fields use the same analyzer - which is exactly what is not true here.
Handling Stop Words, Synonyms, and Sorting
Each language analyzer carries its own stop-word list and stemmer. Sharing lists across languages almost always degrades recall: English stop words tokenized through the German stemmer produce noise, not useful term equivalents. Maintain one list per language. For sortable string fields, use icu_collation_keyword (from analysis-icu) to produce locale-aware sort keys; default lexicographic sort gives the wrong order in any language with accented characters.
Operating Multi-Language Indices
Adding a language sub-field can add 30-70% to the field's posting-list size depending on stemmer aggressiveness and stop-word overlap. Refresh and merge budgets need to be revisited when the language count grows. Pulse tracks per-language sub-field index size and query latency, identifying language sub-fields that consume capacity without serving meaningful traffic.
Frequently Asked Questions
Q: Does Elasticsearch detect document language automatically?
A: Not by default. The Elastic Stack ships a language identification model (lang_ident_model_1) usable via the inference ingest processor since 7.6, which classifies text into ~100 languages. Apply it in an ingest pipeline before the document hits the index, and route or analyze based on the detected language.
Q: Can I put multiple languages in a single Elasticsearch field?
A: Mechanically yes, in practice no. A single analyzer applied to mixed-language text under-stems some languages and over-stems others, lowering recall everywhere. Use a multi-field mapping with one analyzed sub-field per language instead.
Q: How do I boost results in a specific language?
A: Use field boosts in a multi_match query. "fields": ["title.english^2", "title.spanish"] boosts the English sub-field's score 2x relative to Spanish. Set the boosts based on the user's preferred locale or the detected query language.
Q: How do I handle Chinese, Japanese, or Korean (CJK) text?
A: Install a CJK analyzer plugin: analysis-smartcn (Chinese), analysis-kuromoji (Japanese), or analysis-nori (Korean). The icu_analyzer from analysis-icu is a reasonable fallback when plugins aren't available.
Q: What's the right way to handle stop words across languages?
A: Maintain separate stop-word lists per language analyzer. The built-in language analyzers ship with sensible defaults; override them via stopwords or stopwords_path when your domain needs different exclusions.
Q: How does sorting work for multi-language search results?
A: Default lexicographic sort is wrong for any language with non-ASCII letters. Use icu_collation_keyword (analysis-icu plugin) to generate locale-aware sort keys and sort on that field. Provide separate collation fields per locale when sort order must match the user's language.
Related Reading
- Elasticsearch multi-language search guide: a deeper walk-through with index-time language detection.
- Elasticsearch search_as_you_type field data type: combine autocomplete with per-language analysis.
- Elasticsearch create index with mapping: the right way to declare language sub-fields up front.
- Elasticsearch add field to mapping: add a new language sub-field on an existing index.
- Elasticsearch IllegalArgumentException: mapper conflicts: the error you get when two indices map the same field with different language analyzers.
- Elasticsearch index.mapping.total_fields.limit: each language sub-field counts against this limit.