Embeddings are dense vector representations of data — text, images, or structured records — that capture semantic meaning in a numerical format searchable by similarity. OpenSearch supports embedding-based search through its k-NN plugin, which provides approximate nearest-neighbor search using HNSW, IVF, and other algorithms.
This guide covers the end-to-end workflow: choosing models, generating embeddings, indexing vectors in OpenSearch, and building search pipelines.
Choosing an Embedding Model
The embedding model determines search quality. Choose based on your data type and accuracy requirements:
Text Embedding Models
| Model | Dimensions | Use Case | Notes |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | General English text | Fast, lightweight, good starting point |
| all-mpnet-base-v2 | 768 | Higher-quality English text | Better accuracy, 2x compute cost |
| multilingual-e5-base | 768 | Multilingual text | Supports 100+ languages |
| BGE-large-en-v1.5 | 1024 | High-quality English retrieval | State-of-the-art, resource-intensive |
| Cohere embed-v3 | 1024 | Commercial API | Hosted, no infrastructure needed |
| OpenAI text-embedding-3-small | 1536 | Commercial API | Hosted, widely used |
Key trade-off: Higher dimensions generally improve accuracy but increase storage, memory, and search latency linearly.
Generating Embeddings Externally
Generate embeddings in your application before indexing:
from sentence_transformers import SentenceTransformer
from opensearchpy import OpenSearch
model = SentenceTransformer('all-MiniLM-L6-v2')
client = OpenSearch(hosts=[{'host': 'localhost', 'port': 9200}])
documents = [
{"title": "Python web frameworks", "body": "Flask and Django are popular Python web frameworks..."},
{"title": "JavaScript runtime", "body": "Node.js is a server-side JavaScript runtime..."}
]
for i, doc in enumerate(documents):
embedding = model.encode(doc['title'] + ' ' + doc['body'])
doc['embedding'] = embedding.tolist()
client.index(index='articles', id=i, body=doc)
Using OpenSearch ML Commons (Server-Side)
OpenSearch can host embedding models directly, eliminating external infrastructure:
# Register a model
POST /_plugins/_ml/models/_register
{
"name": "all-MiniLM-L6-v2",
"version": "1.0.1",
"model_format": "TORCH_SCRIPT",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "sentence_transformers"
},
"url": "https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip"
}
# Deploy the model to ML nodes
POST /_plugins/_ml/models/<model_id>/_deploy
Once deployed, use ingest pipelines to generate embeddings automatically during indexing:
PUT /_ingest/pipeline/embedding-pipeline
{
"processors": [
{
"text_embedding": {
"model_id": "<model_id>",
"field_map": {
"body": "body_embedding"
}
}
}
]
}
Documents indexed through this pipeline automatically get embeddings generated from the body field.
Index Configuration for Vectors
Creating a k-NN Index
PUT /articles
{
"settings": {
"index.knn": true,
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"title": { "type": "text" },
"body": { "type": "text" },
"embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {
"ef_construction": 256,
"m": 16
}
}
}
}
}
}
HNSW Parameters
| Parameter | Default | Effect |
|---|---|---|
m |
16 | Graph connectivity. Higher = better recall, more memory. 12–32 is typical. |
ef_construction |
512 | Build-time quality. Higher = better graph quality, slower indexing. 128–512 is typical. |
ef_search |
100 (query-time) | Search-time quality. Higher = better recall, slower queries. Tune per-query via API. |
Engine Options
| Engine | Algorithm | Best For |
|---|---|---|
nmslib |
HNSW | General purpose, mature, fast search |
faiss |
HNSW or IVF | Large-scale, supports PQ compression, GPU training |
lucene |
HNSW | Simpler setup, integrated with Lucene segments |
For most deployments, nmslib with HNSW is the recommended default. Use faiss when you need Product Quantization to reduce memory or have very large vector sets (100M+).
Space Types (Distance Metrics)
| Space Type | Use Case |
|---|---|
cosinesimil |
Text embeddings (most common) |
l2 |
Euclidean distance — image features, spatial data |
innerproduct |
Pre-normalized embeddings, maximum inner product search |
Match the space type to how your embedding model was trained. Most text embedding models use cosine similarity.
Querying Embeddings
Basic k-NN Query
POST /articles/_search
{
"query": {
"knn": {
"embedding": {
"vector": [0.12, -0.34, 0.56, ...],
"k": 10
}
}
}
}
k-NN with Filters
Apply filters to narrow the vector search space:
POST /articles/_search
{
"query": {
"knn": {
"embedding": {
"vector": [0.12, -0.34, ...],
"k": 10,
"filter": {
"bool": {
"must": [
{ "term": { "category": "programming" } },
{ "range": { "date": { "gte": "2025-01-01" } } }
]
}
}
}
}
}
}
Filtered k-NN search applies the filter first, then performs vector search on the reduced set. This is efficient when filters are selective.
Neural Query (Server-Side Embedding)
If you've deployed a model via ML Commons:
POST /articles/_search
{
"query": {
"neural": {
"embedding": {
"query_text": "how to build REST APIs in Python",
"model_id": "<model_id>",
"k": 10
}
}
}
}
The neural query type generates the query embedding server-side, so your application doesn't need to call the embedding model.
Memory and Storage Planning
Vector data is memory-intensive. Plan capacity carefully:
Memory per vector: dimensions × 4 bytes (float32)
HNSW graph overhead: ~1.5–2x the raw vector size
Total memory estimate:
memory_GB = num_documents × dimensions × 4 × 2 / 1_000_000_000
| Documents | Dimensions | Raw Vectors | With HNSW | Recommendation |
|---|---|---|---|---|
| 1M | 384 | 1.5 GB | ~3 GB | Single node, 8 GB+ heap |
| 10M | 384 | 15 GB | ~30 GB | Dedicated k-NN nodes, 64 GB+ RAM |
| 10M | 768 | 30 GB | ~60 GB | Multiple dedicated k-NN nodes |
| 100M | 384 | 150 GB | ~300 GB | Sharded across many nodes, consider PQ |
Reducing Memory with Product Quantization
For very large vector sets, use Faiss with PQ compression:
PUT /large-index
{
"settings": { "index.knn": true },
"mappings": {
"properties": {
"embedding": {
"type": "knn_vector",
"dimension": 768,
"method": {
"name": "ivf",
"space_type": "l2",
"engine": "faiss",
"parameters": {
"nlist": 1024,
"nprobes": 10,
"encoder": {
"name": "pq",
"parameters": { "code_size": 32 }
}
}
}
}
}
}
}
PQ reduces memory by 10–30x at the cost of some recall accuracy.
Best Practices
- Batch index documents: Index in bulk (1,000–5,000 documents per batch) for efficient graph construction.
- Warm up k-NN indices: After indexing, run a few queries to load HNSW graphs into memory before serving production traffic.
- Use dedicated ML nodes: If using ML Commons for server-side embeddings, deploy models to dedicated ML nodes to avoid competing with search and indexing workloads.
- Monitor recall: Periodically test search quality against a ground-truth set. HNSW recall typically exceeds 95% with default parameters.
- Match embedding model at index and query time: Always use the same model version for indexing and querying. Mixing models produces meaningless similarity scores.
Frequently Asked Questions
Q: Can I update embeddings without re-indexing the entire document?
Yes, use the Update API to modify the vector field. However, for bulk re-embedding (e.g., after switching models), re-indexing into a new index is more efficient.
Q: How do I handle documents that are too long for the embedding model?
Most embedding models have a maximum token length (typically 256 or 512 tokens). For longer documents, chunk the text and create one vector per chunk. Store the chunk-to-document mapping and deduplicate at query time.
Q: Should I use server-side (ML Commons) or client-side embedding generation?
Server-side is simpler to operate but adds latency to indexing and query paths. Client-side gives you more control, allows GPU acceleration, and keeps ML inference load off your OpenSearch cluster. For production at scale, client-side with dedicated inference infrastructure is typically better.
Q: Can I store multiple vector fields in one index?
Yes. You can have multiple knn_vector fields (e.g., title_embedding and body_embedding) and query them independently or combine results.
Q: What happens when I update my embedding model?
You need to re-generate all embeddings and re-index. The old and new model produce incompatible vector spaces. Use index aliases to swap between old and new indices atomically after re-indexing.