Elasticsearch token_count Field Data Type

The token_count field data type in Elasticsearch is used to store the number of tokens in a text field. It's particularly useful when you need to perform aggregations or sorting based on the number of tokens in a field, without having to recompute this information at query time. This data type is an alternative to using a script to count tokens, which can be less efficient for large-scale operations.

This field type works in conjunction with an analyzer, which is used to tokenize the input text. By default, it uses the standard analyzer, but you can specify a custom analyzer if needed.

Example

PUT my-index
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "token_count": {
            "type": "token_count",
            "analyzer": "standard"
          }
        }
      }
    }
  }
}

PUT my-index/_doc/1
{
  "name": "John Doe"
}

GET my-index/_search
{
  "aggs": {
    "name_token_count": {
      "avg": {
        "field": "name.token_count"
      }
    }
  }
}

Common issues or misuses

Forgetting that token_count is based on the analyzer output, not the raw input string length.
Using token_count on fields that don't require token counting, which can increase index size unnecessarily.
Not considering the impact of analyzer changes on existing token_count fields, as changing the analyzer may affect the token count.

Frequently Asked Questions

Q: Can I use token_count with nested fields?
A: Yes, you can use token_count with nested fields. It will count the tokens for each nested object separately.

Q: Does token_count support multi-fields?
A: Yes, token_count can be used as a multi-field, allowing you to have both the original text field and the token count stored separately.

Q: How does token_count handle null values?
A: By default, token_count treats null values as if they had zero tokens. You can change this behavior using the null_value parameter in the mapping.

Q: Can I update the analyzer for a token_count field after indexing?
A: Changing the analyzer for an existing token_count field requires reindexing the data, as the token counts are computed at index time.

Q: Is token_count more efficient than using a script to count tokens?
A: Yes, token_count is generally more efficient than using a script, especially for large-scale aggregations or sorting operations, as the count is pre-computed and stored.