OpenSearch Embeddings: Indexing, Searching, and Managing Vector Data

Embeddings are dense vector representations of data — text, images, or structured records — that capture semantic meaning in a numerical format searchable by similarity. OpenSearch supports embedding-based search through its k-NN plugin, which provides approximate nearest-neighbor search using HNSW, IVF, and other algorithms.

This guide covers the end-to-end workflow: choosing models, generating embeddings, indexing vectors in OpenSearch, and building search pipelines.

Choosing an Embedding Model

The embedding model determines search quality. Choose based on your data type and accuracy requirements:

Text Embedding Models

Model	Dimensions	Use Case	Notes
all-MiniLM-L6-v2	384	General English text	Fast, lightweight, good starting point
all-mpnet-base-v2	768	Higher-quality English text	Better accuracy, 2x compute cost
multilingual-e5-base	768	Multilingual text	Supports 100+ languages
BGE-large-en-v1.5	1024	High-quality English retrieval	State-of-the-art, resource-intensive
Cohere embed-v3	1024	Commercial API	Hosted, no infrastructure needed
OpenAI text-embedding-3-small	1536	Commercial API	Hosted, widely used

Key trade-off: Higher dimensions generally improve accuracy but increase storage, memory, and search latency linearly.

Generating Embeddings Externally

Generate embeddings in your application before indexing:

from sentence_transformers import SentenceTransformer
from opensearchpy import OpenSearch

model = SentenceTransformer('all-MiniLM-L6-v2')
client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])

documents = [
    {"title": "Python web frameworks", "body": "Flask and Django are popular Python web frameworks..."},
    {"title": "JavaScript runtime", "body": "Node.js is a server-side JavaScript runtime..."}
]

for i, doc in enumerate(documents):
    embedding = model.encode(doc['title'] + ' ' + doc['body'])
    doc['embedding'] = embedding.tolist()
    client.index(index='articles', id=i, body=doc)

Using OpenSearch ML Commons (Server-Side)

OpenSearch can host embedding models directly, eliminating external infrastructure:

# Register a model
POST /_plugins/_ml/models/_register
{
  "name": "all-MiniLM-L6-v2",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers"
  },
  "url": "https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip"
}

# Deploy the model to ML nodes
POST /_plugins/_ml/models/<model_id>/_deploy

Once deployed, use ingest pipelines to generate embeddings automatically during indexing:

PUT /_ingest/pipeline/embedding-pipeline
{
  "processors": [
    {
      "text_embedding": {
        "model_id": "<model_id>",
        "field_map": {
          "body": "body_embedding"
        }
      }
    }
  ]
}

Documents indexed through this pipeline automatically get embeddings generated from the body field.

Index Configuration for Vectors

Creating a k-NN Index

PUT /articles
{
  "settings": {
    "index.knn": true,
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "body": { "type": "text" },
      "embedding": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 256,
            "m": 16
          }
        }
      }
    }
  }
}

HNSW Parameters

Parameter	Default	Effect
`m`	16	Graph connectivity. Higher = better recall, more memory. 12–32 is typical.
`ef_construction`	512	Build-time quality. Higher = better graph quality, slower indexing. 128–512 is typical.
`ef_search`	100 (query-time)	Search-time quality. Higher = better recall, slower queries. Tune per-query via API.

Engine Options

Engine	Algorithm	Best For
`nmslib`	HNSW	General purpose, mature, fast search
`faiss`	HNSW or IVF	Large-scale, supports PQ compression, GPU training
`lucene`	HNSW	Simpler setup, integrated with Lucene segments

For most deployments, nmslib with HNSW is the recommended default. Use faiss when you need Product Quantization to reduce memory or have very large vector sets (100M+).

Space Types (Distance Metrics)

Space Type	Use Case
`cosinesimil`	Text embeddings (most common)
`l2`	Euclidean distance — image features, spatial data
`innerproduct`	Pre-normalized embeddings, maximum inner product search

Match the space type to how your embedding model was trained. Most text embedding models use cosine similarity.

Querying Embeddings

Basic k-NN Query

POST /articles/_search
{
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.12, -0.34, 0.56, ...],
        "k": 10
      }
    }
  }
}

k-NN with Filters

Apply filters to narrow the vector search space:

POST /articles/_search
{
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.12, -0.34, ...],
        "k": 10,
        "filter": {
          "bool": {
            "must": [
              { "term": { "category": "programming" } },
              { "range": { "date": { "gte": "2025-01-01" } } }
            ]
          }
        }
      }
    }
  }
}

Filtered k-NN search applies the filter first, then performs vector search on the reduced set. This is efficient when filters are selective.

Neural Query (Server-Side Embedding)

If you've deployed a model via ML Commons:

POST /articles/_search
{
  "query": {
    "neural": {
      "embedding": {
        "query_text": "how to build REST APIs in Python",
        "model_id": "<model_id>",
        "k": 10
      }
    }
  }
}

The neural query type generates the query embedding server-side, so your application doesn't need to call the embedding model.

Memory and Storage Planning

Vector data is memory-intensive. Plan capacity carefully:

Memory per vector: dimensions × 4 bytes (float32)

HNSW graph overhead: ~1.5–2x the raw vector size

Total memory estimate:

memory_GB = num_documents × dimensions × 4 × 2 / 1_000_000_000

Documents	Dimensions	Raw Vectors	With HNSW	Recommendation
1M	384	1.5 GB	~3 GB	Single node, 8 GB+ heap
10M	384	15 GB	~30 GB	Dedicated k-NN nodes, 64 GB+ RAM
10M	768	30 GB	~60 GB	Multiple dedicated k-NN nodes
100M	384	150 GB	~300 GB	Sharded across many nodes, consider PQ

Reducing Memory with Product Quantization

For very large vector sets, use Faiss with PQ compression:

PUT /large-index
{
  "settings": { "index.knn": true },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 768,
        "method": {
          "name": "ivf",
          "space_type": "l2",
          "engine": "faiss",
          "parameters": {
            "nlist": 1024,
            "nprobes": 10,
            "encoder": {
              "name": "pq",
              "parameters": { "code_size": 32 }
            }
          }
        }
      }
    }
  }
}

PQ reduces memory by 10–30x at the cost of some recall accuracy.

Best Practices

Batch index documents: Index in bulk (1,000–5,000 documents per batch) for efficient graph construction.
Warm up k-NN indices: After indexing, run a few queries to load HNSW graphs into memory before serving production traffic.
Use dedicated ML nodes: If using ML Commons for server-side embeddings, deploy models to dedicated ML nodes to avoid competing with search and indexing workloads.
Monitor recall: Periodically test search quality against a ground-truth set. HNSW recall typically exceeds 95% with default parameters.
Match embedding model at index and query time: Always use the same model version for indexing and querying. Mixing models produces meaningless similarity scores.

Frequently Asked Questions

Q: Can I update embeddings without re-indexing the entire document?

Yes, use the Update API to modify the vector field. However, for bulk re-embedding (e.g., after switching models), re-indexing into a new index is more efficient.

Q: How do I handle documents that are too long for the embedding model?

Most embedding models have a maximum token length (typically 256 or 512 tokens). For longer documents, chunk the text and create one vector per chunk. Store the chunk-to-document mapping and deduplicate at query time.

Q: Should I use server-side (ML Commons) or client-side embedding generation?

Server-side is simpler to operate but adds latency to indexing and query paths. Client-side gives you more control, allows GPU acceleration, and keeps ML inference load off your OpenSearch cluster. For production at scale, client-side with dedicated inference infrastructure is typically better.

Q: Can I store multiple vector fields in one index?

Yes. You can have multiple knn_vector fields (e.g., title_embedding and body_embedding) and query them independently or combine results.

Q: What happens when I update my embedding model?

You need to re-generate all embeddings and re-index. The old and new model produce incompatible vector spaces. Use index aliases to swap between old and new indices atomically after re-indexing.