Elasticsearch Multi-Language Search: Analyzers, Mappings, and Query Patterns

Multi-language search in Elasticsearch is solved by attaching the right language analyzer to each piece of text and then querying the analyzed sub-fields together. There is no auto-magic - language detection, analyzer selection, and per-language ranking are explicit decisions made by the index designer. The standard pattern is a multi-field mapping with one analyzed text sub-field per supported language, queried via multi_match with field-level boosts that reflect the user's locale or detected query language.

When to Use Multi-Language Patterns

Pattern Best for Cost
Multi-field, one sub-field per language Same content needs to be searchable under multiple analyzers Index size grows linearly with languages
Distinct per-language fields Documents store separate translations Documents carry every translation
Per-language index Mono-lingual documents, language known at ingest Cross-language search needs an alias
ICU-based custom analyzer Non-Latin scripts (CJK, Arabic, Thai) Requires the analysis-icu plugin

Configuring Language-Specific Analyzers

Elasticsearch ships built-in analyzers for ~35 languages, each with appropriate tokenization, stop words, and stemming. Use them directly or extend them with a custom analyzer when you need to tweak the stem dictionary or stop-word list.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_custom": {
          "type": "english",
          "stopwords_path": "stopwords/english_custom.txt"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "english": { "type": "text", "analyzer": "english" },
          "spanish": { "type": "text", "analyzer": "spanish" }
        }
      }
    }
  }
}

For Chinese, Japanese, Korean, and Thai, the built-in analyzers are limited; install dedicated plugins (analysis-smartcn, analysis-kuromoji, analysis-nori) or use the icu_analyzer from analysis-icu as a script-aware fallback.

Querying Multi-Language Content

multi_match is the right query for searching across language sub-fields. Boost the user's preferred language higher than the others to keep relevance honest.

GET my_index/_search
{
  "query": {
    "multi_match": {
      "query": "running shoes",
      "fields": ["title.english^3", "title.spanish^1"]
    }
  }
}

For known mono-lingual content, route the query to a single sub-field (title.spanish) instead of all of them. The type: cross_fields mode is mostly inappropriate for multi-language search because it assumes all targeted fields use the same analyzer - which is exactly what is not true here.

Handling Stop Words, Synonyms, and Sorting

Each language analyzer carries its own stop-word list and stemmer. Sharing lists across languages almost always degrades recall: English stop words tokenized through the German stemmer produce noise, not useful term equivalents. Maintain one list per language. For sortable string fields, use icu_collation_keyword (from analysis-icu) to produce locale-aware sort keys; default lexicographic sort gives the wrong order in any language with accented characters.

Operating Multi-Language Indices

Adding a language sub-field can add 30-70% to the field's posting-list size depending on stemmer aggressiveness and stop-word overlap. Refresh and merge budgets need to be revisited when the language count grows. Pulse tracks per-language sub-field index size and query latency, identifying language sub-fields that consume capacity without serving meaningful traffic.

Frequently Asked Questions

Q: Does Elasticsearch detect document language automatically?
A: Not by default. The Elastic Stack ships a language identification model (lang_ident_model_1) usable via the inference ingest processor since 7.6, which classifies text into ~100 languages. Apply it in an ingest pipeline before the document hits the index, and route or analyze based on the detected language.

Q: Can I put multiple languages in a single Elasticsearch field?
A: Mechanically yes, in practice no. A single analyzer applied to mixed-language text under-stems some languages and over-stems others, lowering recall everywhere. Use a multi-field mapping with one analyzed sub-field per language instead.

Q: How do I boost results in a specific language?
A: Use field boosts in a multi_match query. "fields": ["title.english^2", "title.spanish"] boosts the English sub-field's score 2x relative to Spanish. Set the boosts based on the user's preferred locale or the detected query language.

Q: How do I handle Chinese, Japanese, or Korean (CJK) text?
A: Install a CJK analyzer plugin: analysis-smartcn (Chinese), analysis-kuromoji (Japanese), or analysis-nori (Korean). The icu_analyzer from analysis-icu is a reasonable fallback when plugins aren't available.

Q: What's the right way to handle stop words across languages?
A: Maintain separate stop-word lists per language analyzer. The built-in language analyzers ship with sensible defaults; override them via stopwords or stopwords_path when your domain needs different exclusions.

Q: How does sorting work for multi-language search results?
A: Default lexicographic sort is wrong for any language with non-ASCII letters. Use icu_collation_keyword (analysis-icu plugin) to generate locale-aware sort keys and sort on that field. Provide separate collation fields per locale when sort order must match the user's language.

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.