Elasticsearch Multi-Language Search Guide: Analyzers, Multi-Fields, and Querying

Multi-language search in Elasticsearch hinges on giving each language its own analyzer chain - tokenization, stop-word removal, and stemming rules differ enough that a single analyzer cannot serve them all without measurably worse recall. The standard approach is to map each text field as a multi-field, with one analyzed sub-field per supported language, and route queries to the right sub-fields based on either user preference or detected document language. This guide covers the three patterns that work in production: per-language sub-fields on one field, separate per-language top-level fields, and per-document index-time language routing.

When to Use Each Multi-Language Pattern

Pattern When to use Trade-off
Multi-fields, one per language Same document content needs to be searchable in multiple languages Index size grows linearly with supported languages
Per-language top-level fields Each document has distinct content per language (translations) Documents store all languages; queries must target the right one
Per-language index, routed at index time Documents are mono-lingual and the language is known at ingest Cross-language search needs an alias and cross-index queries
icu_analyzer based custom analyzer CJK, Arabic, Hebrew, Thai - scripts with no whitespace tokenization Requires the analysis-icu plugin

Pattern 1: Per-Language Multi-Fields

Map the field once and declare an analyzed sub-field for each language. This is the right choice when the same text needs to be retrievable under multiple language analyzers (for example, a product description searched by users of different locales).

PUT articles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "en": { "type": "text", "analyzer": "english" },
          "fr": { "type": "text", "analyzer": "french" },
          "de": { "type": "text", "analyzer": "german" }
        }
      }
    }
  }
}

At query time, target the language sub-field your user is most likely searching in and fall back to the others with lower boosts.

GET articles/_search
{
  "query": {
    "multi_match": {
      "query": "running shoes",
      "fields": ["title.en^3", "title.fr^1", "title.de^1"]
    }
  }
}

Pattern 2: Distinct Per-Language Top-Level Fields

When documents store translations (separate title_en, title_fr, title_de), map each as text with its language analyzer. Queries either target a single language field at a time or use multi_match across all of them. This pattern keeps each translation cleanly separated and avoids analyzing the same content multiple times.

Pattern 3: Index-Time Language Routing With lang_ident

For ingest pipelines where the source language is unknown, use the inference processor with the built-in lang_ident_model_1 model (shipped with the Elastic Stack since 7.6) to detect the language and route the document to a per-language index or to set a language field used to pick the right analyzer downstream. The lang_ident_model_1 is a model, not an analyzer - the older "lang_ident analyzer" framing in some third-party tutorials is incorrect.

PUT _ingest/pipeline/detect-language
{
  "processors": [
    {
      "inference": {
        "model_id": "lang_ident_model_1",
        "inference_config": { "classification": { "num_top_classes": 1 } },
        "field_map": { "text": "text" }
      }
    }
  ]
}

Handling Non-Latin Scripts

Elasticsearch's built-in language analyzers cover most European languages plus Arabic, Persian, Thai, and several others. For Chinese, Japanese, and Korean, the bundled analyzers are limited; production systems use:

  • smartcn (Chinese, ships with the analysis-smartcn plugin) or third-party ik.
  • kuromoji (Japanese, analysis-kuromoji plugin) for accurate morphological tokenization.
  • nori (Korean, analysis-nori plugin).
  • icu_analyzer (analysis-icu plugin) as a general fallback that handles script boundaries, Unicode normalization, and word segmentation for many languages without dedicated analyzers.

Operating Multi-Language Indices

Multi-language indices are larger than mono-lingual equivalents - roughly N times the analyzed posting list size when you map N sub-fields, plus extra normalization overhead. Refresh interval, merge throttling, and shard sizing all need to be revisited when adding languages. Pulse profiles per-field index size and query latency by language sub-field, surfacing which language sub-fields are paying for themselves in query volume versus which are inflating index size without serving traffic.

Frequently Asked Questions

Q: How do I detect the language of documents at ingest in Elasticsearch?
A: Use the inference ingest processor with the built-in lang_ident_model_1 model (available since Elastic Stack 7.6). It classifies the input text into one of ~100 languages and writes the result to a field you can use to route the document or to pick the right analyzer.

Q: Should I use multi-fields or separate fields for multi-language search?
A: Use multi-fields when the same content needs to be searched under multiple language analyzers. Use separate top-level fields when documents carry distinct translations per language. Use a per-language index when documents are mono-lingual and routing on language is cheap.

Q: How do I handle Chinese, Japanese, or Korean text in Elasticsearch?
A: Install a CJK analyzer plugin: analysis-smartcn or third-party ik for Chinese, analysis-kuromoji for Japanese, analysis-nori for Korean. The analysis-icu plugin's icu_analyzer is a reasonable fallback when you can't install language-specific plugins.

Q: Can I search across all languages in a single query?
A: Yes. A multi_match query targeting every language sub-field with type: most_fields or type: best_fields runs against all of them in parallel. Use field boosts to weight the user's preferred language higher.

Q: How does sorting work with multi-language results?
A: Sorting text alphabetically across languages needs a locale-aware collation. Use the icu_collation_keyword field type (from analysis-icu) to produce a sort-friendly keyword variant with locale-specific ordering, and sort on that field rather than the analyzed text field.

Q: How do I share stop-word and synonym lists across language sub-fields?
A: You don't. Stop words and synonyms are language-specific. Maintain one list per language and attach it to that language's analyzer. Cross-language shared lists usually break stemming and lower recall on every language they touch.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.