Multi-language search in Elasticsearch hinges on giving each language its own analyzer chain - tokenization, stop-word removal, and stemming rules differ enough that a single analyzer cannot serve them all without measurably worse recall. The standard approach is to map each text field as a multi-field, with one analyzed sub-field per supported language, and route queries to the right sub-fields based on either user preference or detected document language. This guide covers the three patterns that work in production: per-language sub-fields on one field, separate per-language top-level fields, and per-document index-time language routing.
When to Use Each Multi-Language Pattern
| Pattern | When to use | Trade-off |
|---|---|---|
| Multi-fields, one per language | Same document content needs to be searchable in multiple languages | Index size grows linearly with supported languages |
| Per-language top-level fields | Each document has distinct content per language (translations) | Documents store all languages; queries must target the right one |
| Per-language index, routed at index time | Documents are mono-lingual and the language is known at ingest | Cross-language search needs an alias and cross-index queries |
icu_analyzer based custom analyzer |
CJK, Arabic, Hebrew, Thai - scripts with no whitespace tokenization | Requires the analysis-icu plugin |
Pattern 1: Per-Language Multi-Fields
Map the field once and declare an analyzed sub-field for each language. This is the right choice when the same text needs to be retrievable under multiple language analyzers (for example, a product description searched by users of different locales).
PUT articles
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"en": { "type": "text", "analyzer": "english" },
"fr": { "type": "text", "analyzer": "french" },
"de": { "type": "text", "analyzer": "german" }
}
}
}
}
}
At query time, target the language sub-field your user is most likely searching in and fall back to the others with lower boosts.
GET articles/_search
{
"query": {
"multi_match": {
"query": "running shoes",
"fields": ["title.en^3", "title.fr^1", "title.de^1"]
}
}
}
Pattern 2: Distinct Per-Language Top-Level Fields
When documents store translations (separate title_en, title_fr, title_de), map each as text with its language analyzer. Queries either target a single language field at a time or use multi_match across all of them. This pattern keeps each translation cleanly separated and avoids analyzing the same content multiple times.
Pattern 3: Index-Time Language Routing With lang_ident
For ingest pipelines where the source language is unknown, use the inference processor with the built-in lang_ident_model_1 model (shipped with the Elastic Stack since 7.6) to detect the language and route the document to a per-language index or to set a language field used to pick the right analyzer downstream. The lang_ident_model_1 is a model, not an analyzer - the older "lang_ident analyzer" framing in some third-party tutorials is incorrect.
PUT _ingest/pipeline/detect-language
{
"processors": [
{
"inference": {
"model_id": "lang_ident_model_1",
"inference_config": { "classification": { "num_top_classes": 1 } },
"field_map": { "text": "text" }
}
}
]
}
Handling Non-Latin Scripts
Elasticsearch's built-in language analyzers cover most European languages plus Arabic, Persian, Thai, and several others. For Chinese, Japanese, and Korean, the bundled analyzers are limited; production systems use:
smartcn(Chinese, ships with theanalysis-smartcnplugin) or third-partyik.kuromoji(Japanese,analysis-kuromojiplugin) for accurate morphological tokenization.nori(Korean,analysis-noriplugin).icu_analyzer(analysis-icuplugin) as a general fallback that handles script boundaries, Unicode normalization, and word segmentation for many languages without dedicated analyzers.
Operating Multi-Language Indices
Multi-language indices are larger than mono-lingual equivalents - roughly N times the analyzed posting list size when you map N sub-fields, plus extra normalization overhead. Refresh interval, merge throttling, and shard sizing all need to be revisited when adding languages. Pulse profiles per-field index size and query latency by language sub-field, surfacing which language sub-fields are paying for themselves in query volume versus which are inflating index size without serving traffic.
Frequently Asked Questions
Q: How do I detect the language of documents at ingest in Elasticsearch?
A: Use the inference ingest processor with the built-in lang_ident_model_1 model (available since Elastic Stack 7.6). It classifies the input text into one of ~100 languages and writes the result to a field you can use to route the document or to pick the right analyzer.
Q: Should I use multi-fields or separate fields for multi-language search?
A: Use multi-fields when the same content needs to be searched under multiple language analyzers. Use separate top-level fields when documents carry distinct translations per language. Use a per-language index when documents are mono-lingual and routing on language is cheap.
Q: How do I handle Chinese, Japanese, or Korean text in Elasticsearch?
A: Install a CJK analyzer plugin: analysis-smartcn or third-party ik for Chinese, analysis-kuromoji for Japanese, analysis-nori for Korean. The analysis-icu plugin's icu_analyzer is a reasonable fallback when you can't install language-specific plugins.
Q: Can I search across all languages in a single query?
A: Yes. A multi_match query targeting every language sub-field with type: most_fields or type: best_fields runs against all of them in parallel. Use field boosts to weight the user's preferred language higher.
Q: How does sorting work with multi-language results?
A: Sorting text alphabetically across languages needs a locale-aware collation. Use the icu_collation_keyword field type (from analysis-icu) to produce a sort-friendly keyword variant with locale-specific ordering, and sort on that field rather than the analyzed text field.
Q: How do I share stop-word and synonym lists across language sub-fields?
A: You don't. Stop words and synonyms are language-specific. Maintain one list per language and attach it to that language's analyzer. Cross-language shared lists usually break stemming and lower recall on every language they touch.
Related Reading
- Elasticsearch multi-language search: a tighter introduction to the same patterns.
- Elasticsearch create index with mapping: declare multi-field language mappings up front.
- Elasticsearch add field to mapping: add a new language sub-field to an existing index.
- Elasticsearch search_as_you_type field data type: combine autocomplete with per-language analysis.
- Elasticsearch index.mapping.total_fields.limit: each language sub-field counts toward this limit.
- Elasticsearch cross index query: query across per-language indices behind an alias.