Pulse 2025 Product Roundup: From Monitoring to AI-Native Control Plane

Read more

OpenSearch Embeddings: Indexing, Searching, and Managing Vector Data

Embeddings are dense vector representations of data — text, images, or structured records — that capture semantic meaning in a numerical format searchable by similarity. OpenSearch supports embedding-based search through its k-NN plugin, which provides approximate nearest-neighbor search using HNSW, IVF, and other algorithms.

This guide covers the end-to-end workflow: choosing models, generating embeddings, indexing vectors in OpenSearch, and building search pipelines.

Choosing an Embedding Model

The embedding model determines search quality. Choose based on your data type and accuracy requirements:

Text Embedding Models

Model Dimensions Use Case Notes
all-MiniLM-L6-v2 384 General English text Fast, lightweight, good starting point
all-mpnet-base-v2 768 Higher-quality English text Better accuracy, 2x compute cost
multilingual-e5-base 768 Multilingual text Supports 100+ languages
BGE-large-en-v1.5 1024 High-quality English retrieval State-of-the-art, resource-intensive
Cohere embed-v3 1024 Commercial API Hosted, no infrastructure needed
OpenAI text-embedding-3-small 1536 Commercial API Hosted, widely used

Key trade-off: Higher dimensions generally improve accuracy but increase storage, memory, and search latency linearly.

Generating Embeddings Externally

Generate embeddings in your application before indexing:

from sentence_transformers import SentenceTransformer
from opensearchpy import OpenSearch

model = SentenceTransformer('all-MiniLM-L6-v2')
client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])

documents = [
    {"title": "Python web frameworks", "body": "Flask and Django are popular Python web frameworks..."},
    {"title": "JavaScript runtime", "body": "Node.js is a server-side JavaScript runtime..."}
]

for i, doc in enumerate(documents):
    embedding = model.encode(doc['title'] + ' ' + doc['body'])
    doc['embedding'] = embedding.tolist()
    client.index(index='articles', id=i, body=doc)

Using OpenSearch ML Commons (Server-Side)

OpenSearch can host embedding models directly, eliminating external infrastructure:

# Register a model
POST /_plugins/_ml/models/_register
{
  "name": "all-MiniLM-L6-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers"
  },
  "url": "https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip"
}

# Deploy the model to ML nodes
POST /_plugins/_ml/models/<model_id>/_deploy

Once deployed, use ingest pipelines to generate embeddings automatically during indexing:

PUT /_ingest/pipeline/embedding-pipeline
{
  "processors": [
    {
      "text_embedding": {
        "model_id": "<model_id>",
        "field_map": {
          "body": "body_embedding"
        }
      }
    }
  ]
}

Documents indexed through this pipeline automatically get embeddings generated from the body field.

Index Configuration for Vectors

Creating a k-NN Index

PUT /articles
{
  "settings": {
    "index.knn": true,
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "body": { "type": "text" },
      "embedding": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 256,
            "m": 16
          }
        }
      }
    }
  }
}

HNSW Parameters

Parameter Default Effect
m 16 Graph connectivity. Higher = better recall, more memory. 12–32 is typical.
ef_construction 512 Build-time quality. Higher = better graph quality, slower indexing. 128–512 is typical.
ef_search 100 (query-time) Search-time quality. Higher = better recall, slower queries. Tune per-query via API.

Engine Options

Engine Algorithm Best For
nmslib HNSW General purpose, mature, fast search
faiss HNSW or IVF Large-scale, supports PQ compression, GPU training
lucene HNSW Simpler setup, integrated with Lucene segments

For most deployments, nmslib with HNSW is the recommended default. Use faiss when you need Product Quantization to reduce memory or have very large vector sets (100M+).

Space Types (Distance Metrics)

Space Type Use Case
cosinesimil Text embeddings (most common)
l2 Euclidean distance — image features, spatial data
innerproduct Pre-normalized embeddings, maximum inner product search

Match the space type to how your embedding model was trained. Most text embedding models use cosine similarity.

Querying Embeddings

Basic k-NN Query

POST /articles/_search
{
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.12, -0.34, 0.56, ...],
        "k": 10
      }
    }
  }
}

k-NN with Filters

Apply filters to narrow the vector search space:

POST /articles/_search
{
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.12, -0.34, ...],
        "k": 10,
        "filter": {
          "bool": {
            "must": [
              { "term": { "category": "programming" } },
              { "range": { "date": { "gte": "2025-01-01" } } }
            ]
          }
        }
      }
    }
  }
}

Filtered k-NN search applies the filter first, then performs vector search on the reduced set. This is efficient when filters are selective.

Neural Query (Server-Side Embedding)

If you've deployed a model via ML Commons:

POST /articles/_search
{
  "query": {
    "neural": {
      "embedding": {
        "query_text": "how to build REST APIs in Python",
        "model_id": "<model_id>",
        "k": 10
      }
    }
  }
}

The neural query type generates the query embedding server-side, so your application doesn't need to call the embedding model.

Memory and Storage Planning

Vector data is memory-intensive. Plan capacity carefully:

Memory per vector: dimensions × 4 bytes (float32)

HNSW graph overhead: ~1.5–2x the raw vector size

Total memory estimate:

memory_GB = num_documents × dimensions × 4 × 2 / 1_000_000_000
Documents Dimensions Raw Vectors With HNSW Recommendation
1M 384 1.5 GB ~3 GB Single node, 8 GB+ heap
10M 384 15 GB ~30 GB Dedicated k-NN nodes, 64 GB+ RAM
10M 768 30 GB ~60 GB Multiple dedicated k-NN nodes
100M 384 150 GB ~300 GB Sharded across many nodes, consider PQ

Reducing Memory with Product Quantization

For very large vector sets, use Faiss with PQ compression:

PUT /large-index
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "name": "ivf",
          "space_type": "l2",
          "engine": "faiss",
          "parameters": {
            "nlist": 1024,
            "nprobes": 10,
            "encoder": {
              "name": "pq",
              "parameters": { "code_size": 32 }
            }
          }
        }
      }
    }
  }
}

PQ reduces memory by 10–30x at the cost of some recall accuracy.

Best Practices

  1. Batch index documents: Index in bulk (1,000–5,000 documents per batch) for efficient graph construction.
  2. Warm up k-NN indices: After indexing, run a few queries to load HNSW graphs into memory before serving production traffic.
  3. Use dedicated ML nodes: If using ML Commons for server-side embeddings, deploy models to dedicated ML nodes to avoid competing with search and indexing workloads.
  4. Monitor recall: Periodically test search quality against a ground-truth set. HNSW recall typically exceeds 95% with default parameters.
  5. Match embedding model at index and query time: Always use the same model version for indexing and querying. Mixing models produces meaningless similarity scores.

Frequently Asked Questions

Q: Can I update embeddings without re-indexing the entire document?

Yes, use the Update API to modify the vector field. However, for bulk re-embedding (e.g., after switching models), re-indexing into a new index is more efficient.

Q: How do I handle documents that are too long for the embedding model?

Most embedding models have a maximum token length (typically 256 or 512 tokens). For longer documents, chunk the text and create one vector per chunk. Store the chunk-to-document mapping and deduplicate at query time.

Q: Should I use server-side (ML Commons) or client-side embedding generation?

Server-side is simpler to operate but adds latency to indexing and query paths. Client-side gives you more control, allows GPU acceleration, and keeps ML inference load off your OpenSearch cluster. For production at scale, client-side with dedicated inference infrastructure is typically better.

Q: Can I store multiple vector fields in one index?

Yes. You can have multiple knn_vector fields (e.g., title_embedding and body_embedding) and query them independently or combine results.

Q: What happens when I update my embedding model?

You need to re-generate all embeddings and re-index. The old and new model produce incompatible vector spaces. Use index aliases to swap between old and new indices atomically after re-indexing.

Pulse - Elasticsearch Operations Done Right

Pulse can solve your OpenSearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.