Large Language Models (LLMs) can be integrated with Elasticsearch to create powerful AI-enhanced search applications. By combining Elasticsearch's search capabilities with LLMs' natural language understanding, you can build semantic search, question-answering systems, Retrieval-Augmented Generation (RAG) applications, and more intelligent search experiences.
Use Cases for LLMs with Elasticsearch
1. Semantic Search
- Search by meaning rather than exact keywords
- Find conceptually similar documents
- Improve search relevance with embeddings
2. Retrieval-Augmented Generation (RAG)
- Provide context from Elasticsearch to LLMs
- Generate accurate, grounded responses
- Reduce hallucinations with factual data
3. Question Answering
- Natural language queries
- Contextual answers from documents
- Citation and source linking
4. Document Classification
- Automatic categorization
- Content tagging
- Sentiment analysis
5. Content Recommendations
- Similar document suggestions
- Personalized content discovery
- Context-aware recommendations
Key Technologies
Vector Search in Elasticsearch
Dense vectors (introduced in Elasticsearch 7.3):
- Store embeddings from LLMs
- k-NN (k-nearest neighbors) search
- Cosine similarity calculations
- Approximate nearest neighbor (ANN) algorithms
Elasticsearch 8.x enhancements:
- Native vector search
- Improved performance with HNSW algorithm
- Hybrid search (combining text and vector)
- Vector quantization for efficiency
Supported Vector Dimensions
Elasticsearch supports vectors up to 4096 dimensions (8.x+):
- Small models: 384-768 dimensions (e.g., MiniLM)
- Medium models: 768-1024 dimensions (e.g., BERT)
- Large models: 1536-4096 dimensions (e.g., OpenAI embeddings)
Setting Up Vector Search
Step 1: Create Index with Vector Field
Define mapping with dense_vector field:
PUT /documents
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"content": {
"type": "text"
},
"title_embedding": {
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "cosine"
},
"content_embedding": {
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "cosine"
}
}
}
}
Similarity options:
cosine
: Cosine similarity (recommended for normalized embeddings)dot_product
: Dot product similarityl2_norm
: Euclidean distance
Step 2: Generate Embeddings
Using Python with sentence-transformers:
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch
# Initialize model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Initialize Elasticsearch client
es = Elasticsearch(['http://localhost:9200'])
# Generate embedding
text = "Elasticsearch is a distributed search engine"
embedding = model.encode(text).tolist()
# Index document with embedding
doc = {
"title": "Elasticsearch Overview",
"content": text,
"content_embedding": embedding
}
es.index(index="documents", body=doc)
Using OpenAI embeddings:
import openai
from elasticsearch import Elasticsearch
openai.api_key = "your-api-key"
def get_embedding(text, model="text-embedding-3-small"):
response = openai.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
# Generate and index
text = "Elasticsearch with LLMs enables semantic search"
embedding = get_embedding(text)
doc = {
"title": "LLM Integration",
"content": text,
"content_embedding": embedding
}
es.index(index="documents", body=doc)
Step 3: Perform Vector Search
k-NN search query:
POST /documents/_search
{
"knn": {
"field": "content_embedding",
"query_vector": [0.1, 0.2, ..., 0.384],
"k": 10,
"num_candidates": 100
},
"fields": ["title", "content"]
}
Python example:
def semantic_search(query_text, k=10):
# Generate query embedding
query_embedding = model.encode(query_text).tolist()
# Search
response = es.search(
index="documents",
knn={
"field": "content_embedding",
"query_vector": query_embedding,
"k": k,
"num_candidates": 100
},
fields=["title", "content"]
)
return response['hits']['hits']
# Perform search
results = semantic_search("What is machine learning?")
Hybrid Search (Text + Vector)
Combine traditional text search with vector search for best results:
POST /documents/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"content": {
"query": "machine learning",
"boost": 1.0
}
}
}
]
}
},
"knn": {
"field": "content_embedding",
"query_vector": [0.1, 0.2, ..., 0.384],
"k": 10,
"num_candidates": 100,
"boost": 2.0
},
"size": 10
}
Python implementation:
def hybrid_search(query_text, k=10):
query_embedding = model.encode(query_text).tolist()
response = es.search(
index="documents",
query={
"bool": {
"should": [
{
"match": {
"content": {
"query": query_text,
"boost": 1.0
}
}
}
]
}
},
knn={
"field": "content_embedding",
"query_vector": query_embedding,
"k": k,
"num_candidates": 100,
"boost": 2.0
},
size=k
)
return response['hits']['hits']
Building a RAG Application
Architecture Overview
- User Query → LLM generates query embedding
- Vector Search → Elasticsearch finds relevant documents
- Context Retrieval → Documents provided to LLM
- Response Generation → LLM generates answer with context
- Return Result → Answer with citations
Implementation Example
import openai
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
class RAGSystem:
def __init__(self):
self.es = Elasticsearch(['http://localhost:9200'])
self.model = SentenceTransformer('all-MiniLM-L6-v2')
openai.api_key = "your-api-key"
def retrieve_context(self, query, k=5):
"""Retrieve relevant documents from Elasticsearch"""
query_embedding = self.model.encode(query).tolist()
response = self.es.search(
index="documents",
knn={
"field": "content_embedding",
"query_vector": query_embedding,
"k": k,
"num_candidates": 100
},
fields=["title", "content"],
_source=False
)
documents = []
for hit in response['hits']['hits']:
documents.append({
'title': hit['fields']['title'][0],
'content': hit['fields']['content'][0],
'score': hit['_score']
})
return documents
def generate_answer(self, query, context_docs):
"""Generate answer using LLM with context"""
# Format context
context = "\n\n".join([
f"Document {i+1}: {doc['title']}\n{doc['content']}"
for i, doc in enumerate(context_docs)
])
# Create prompt
prompt = f"""Answer the following question based on the provided context.
If the answer is not in the context, say "I don't have enough information to answer this question."
Context:
{context}
Question: {query}
Answer:"""
# Call LLM
response = openai.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
{"role": "user", "content": prompt}
],
temperature=0.7,
max_tokens=500
)
return {
'answer': response.choices[0].message.content,
'sources': context_docs
}
def query(self, question):
"""End-to-end RAG query"""
# Retrieve relevant documents
context_docs = self.retrieve_context(question, k=5)
# Generate answer
result = self.generate_answer(question, context_docs)
return result
# Usage
rag = RAGSystem()
result = rag.query("How does Elasticsearch handle scalability?")
print("Answer:", result['answer'])
print("\nSources:")
for doc in result['sources']:
print(f"- {doc['title']} (score: {doc['score']:.3f})")
Using Elasticsearch ML Models
Elasticsearch NLP Models
Elasticsearch 8.x includes built-in NLP capabilities:
Supported tasks:
- Text embeddings
- Named entity recognition (NER)
- Sentiment analysis
- Text classification
- Zero-shot classification
Deploying a Model in Elasticsearch
Step 1: Upload model:
from eland.ml.pytorch import PyTorchModel
from elasticsearch import Elasticsearch
es = Elasticsearch(['http://localhost:9200'])
# Import model from Hugging Face
model = PyTorchModel(
es,
'sentence-transformers/all-MiniLM-L6-v2',
task_type='text_embedding'
)
model.import_model()
Step 2: Deploy model:
POST _ml/trained_models/all-MiniLM-L6-v2/deployment/_start
{
"number_of_allocations": 1
}
Step 3: Use model in ingest pipeline:
PUT _ingest/pipeline/text-embeddings
{
"processors": [
{
"inference": {
"model_id": "all-MiniLM-L6-v2",
"target_field": "content_embedding",
"field_map": {
"content": "text_field"
}
}
}
]
}
Index with automatic embedding generation:
POST /documents/_doc?pipeline=text-embeddings
{
"title": "Elasticsearch ML",
"content": "Machine learning features in Elasticsearch"
}
Best Practices
1. Embedding Model Selection
Considerations:
- Vector dimensions vs. accuracy trade-off
- Inference speed requirements
- Model size and memory usage
- Language support
- Domain-specific models
Popular models:
all-MiniLM-L6-v2
: 384 dims, fast, good general purposeall-mpnet-base-v2
: 768 dims, better accuracyOpenAI text-embedding-3-small
: 1536 dims, high quality- Domain-specific: Legal-BERT, BioBERT, etc.
2. Performance Optimization
Indexing:
- Batch document indexing
- Use bulk API
- Optimize vector dimensions
- Consider vector quantization
Search:
- Adjust
num_candidates
parameter - Use filters to reduce search space
- Cache query embeddings
- Implement query result caching
Example with filtering:
POST /documents/_search
{
"query": {
"bool": {
"filter": [
{"term": {"category": "technology"}}
]
}
},
"knn": {
"field": "content_embedding",
"query_vector": [...],
"k": 10,
"num_candidates": 100
}
}
3. Cost Management
LLM API costs:
- Cache embeddings (don't regenerate)
- Batch embedding generation
- Use smaller models when appropriate
- Implement rate limiting
Elasticsearch costs:
- Monitor index size
- Use appropriate hardware
- Optimize replica settings
- Implement data lifecycle management
4. Quality Assurance
Evaluation metrics:
- Precision@k
- Recall@k
- Mean Reciprocal Rank (MRR)
- Normalized Discounted Cumulative Gain (NDCG)
A/B testing:
- Compare different embedding models
- Test hybrid vs. pure vector search
- Measure user engagement
- Monitor query performance
Advanced Techniques
1. Re-ranking
Improve results with two-stage retrieval:
def search_with_reranking(query, k=10):
# Stage 1: Fast retrieval
candidates = semantic_search(query, k=100)
# Stage 2: Re-rank with more sophisticated model
reranked = cross_encoder_rerank(query, candidates)
return reranked[:k]
2. Query Expansion
Enhance queries with LLM-generated variations:
def expand_query(query):
response = openai.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": f"Generate 3 alternative phrasings of this query: {query}"
}]
)
return [query] + parse_alternatives(response)
def multi_query_search(query):
expanded_queries = expand_query(query)
all_results = []
for q in expanded_queries:
results = semantic_search(q, k=20)
all_results.extend(results)
# Deduplicate and rank
return deduplicate_and_rank(all_results)
3. Contextual Compression
Reduce context size while preserving relevance:
def compress_context(query, documents):
compressed = []
for doc in documents:
# Extract most relevant sentences
sentences = extract_relevant_sentences(doc['content'], query)
compressed.append({
'title': doc['title'],
'content': ' '.join(sentences),
'score': doc['score']
})
return compressed
Frequently Asked Questions
Q: What's the difference between text search and vector search?
A: Text search matches keywords and uses statistical relevance, while vector search understands semantic meaning and finds conceptually similar content.
Q: Do I need to choose between text and vector search?
A: No, hybrid search combining both approaches often yields the best results.
Q: Which embedding model should I use?
A: Start with all-MiniLM-L6-v2
for speed and efficiency. Use larger models like all-mpnet-base-v2
or OpenAI embeddings for better accuracy.
Q: How do I handle large documents?
A: Split documents into chunks (e.g., paragraphs or sections), embed each chunk separately, and retrieve at chunk level.
Q: Can I use multiple embedding models?
A: Yes, store multiple embedding fields and query them separately or combine results.
Q: How expensive is it to run LLMs with Elasticsearch?
A: Embedding generation has one-time cost; vector search in Elasticsearch is efficient. Main ongoing cost is LLM API calls for answer generation.
Q: Do embeddings need to be regenerated when documents change?
A: Yes, update embeddings when document content changes significantly.
Q: Can I use open-source LLMs instead of OpenAI?
A: Yes, models like Llama 2, Mistral, or Falcon work well and can be self-hosted.
Q: How do I prevent LLM hallucinations?
A: Use RAG to ground responses in retrieved documents, set appropriate temperature, and validate answers against source content.
Q: What's the best way to evaluate search quality?
A: Combine automated metrics (precision, recall) with user feedback and manual relevance assessments.