Using Large Language Models with Elasticsearch

Large Language Models (LLMs) can be integrated with Elasticsearch to create powerful AI-enhanced search applications. By combining Elasticsearch's search capabilities with LLMs' natural language understanding, you can build semantic search, question-answering systems, Retrieval-Augmented Generation (RAG) applications, and more intelligent search experiences.

Use Cases for LLMs with Elasticsearch

1. Semantic Search

  • Search by meaning rather than exact keywords
  • Find conceptually similar documents
  • Improve search relevance with embeddings

2. Retrieval-Augmented Generation (RAG)

  • Provide context from Elasticsearch to LLMs
  • Generate accurate, grounded responses
  • Reduce hallucinations with factual data

3. Question Answering

  • Natural language queries
  • Contextual answers from documents
  • Citation and source linking

4. Document Classification

  • Automatic categorization
  • Content tagging
  • Sentiment analysis

5. Content Recommendations

  • Similar document suggestions
  • Personalized content discovery
  • Context-aware recommendations

Key Technologies

Vector Search in Elasticsearch

Dense vectors (introduced in Elasticsearch 7.3):

  • Store embeddings from LLMs
  • k-NN (k-nearest neighbors) search
  • Cosine similarity calculations
  • Approximate nearest neighbor (ANN) algorithms

Elasticsearch 8.x enhancements:

  • Native vector search
  • Improved performance with HNSW algorithm
  • Hybrid search (combining text and vector)
  • Vector quantization for efficiency

Supported Vector Dimensions

Elasticsearch supports vectors up to 4096 dimensions (8.x+):

  • Small models: 384-768 dimensions (e.g., MiniLM)
  • Medium models: 768-1024 dimensions (e.g., BERT)
  • Large models: 1536-4096 dimensions (e.g., OpenAI embeddings)

Step 1: Create Index with Vector Field

Define mapping with dense_vector field:

PUT /documents
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "content": {
        "type": "text"
      },
      "title_embedding": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      },
      "content_embedding": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

Similarity options:

  • cosine: Cosine similarity (recommended for normalized embeddings)
  • dot_product: Dot product similarity
  • l2_norm: Euclidean distance

Step 2: Generate Embeddings

Using Python with sentence-transformers:

from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

# Initialize model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Initialize Elasticsearch client
es = Elasticsearch(['http://localhost:9200'])

# Generate embedding
text = "Elasticsearch is a distributed search engine"
embedding = model.encode(text).tolist()

# Index document with embedding
doc = {
    "title": "Elasticsearch Overview",
    "content": text,
    "content_embedding": embedding
}

es.index(index="documents", body=doc)

Using OpenAI embeddings:

import openai
from elasticsearch import Elasticsearch

openai.api_key = "your-api-key"

def get_embedding(text, model="text-embedding-3-small"):
    response = openai.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Generate and index
text = "Elasticsearch with LLMs enables semantic search"
embedding = get_embedding(text)

doc = {
    "title": "LLM Integration",
    "content": text,
    "content_embedding": embedding
}

es.index(index="documents", body=doc)

Step 3: Perform Vector Search

k-NN search query:

POST /documents/_search
{
  "knn": {
    "field": "content_embedding",
    "query_vector": [0.1, 0.2, ..., 0.384],
    "k": 10,
    "num_candidates": 100
  },
  "fields": ["title", "content"]
}

Python example:

def semantic_search(query_text, k=10):
    # Generate query embedding
    query_embedding = model.encode(query_text).tolist()

    # Search
    response = es.search(
        index="documents",
        knn={
            "field": "content_embedding",
            "query_vector": query_embedding,
            "k": k,
            "num_candidates": 100
        },
        fields=["title", "content"]
    )

    return response['hits']['hits']

# Perform search
results = semantic_search("What is machine learning?")

Hybrid Search (Text + Vector)

Combine traditional text search with vector search for best results:

POST /documents/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "content": {
              "query": "machine learning",
              "boost": 1.0
            }
          }
        }
      ]
    }
  },
  "knn": {
    "field": "content_embedding",
    "query_vector": [0.1, 0.2, ..., 0.384],
    "k": 10,
    "num_candidates": 100,
    "boost": 2.0
  },
  "size": 10
}

Python implementation:

def hybrid_search(query_text, k=10):
    query_embedding = model.encode(query_text).tolist()

    response = es.search(
        index="documents",
        query={
            "bool": {
                "should": [
                    {
                        "match": {
                            "content": {
                                "query": query_text,
                                "boost": 1.0
                            }
                        }
                    }
                ]
            }
        },
        knn={
            "field": "content_embedding",
            "query_vector": query_embedding,
            "k": k,
            "num_candidates": 100,
            "boost": 2.0
        },
        size=k
    )

    return response['hits']['hits']

Building a RAG Application

Architecture Overview

  1. User Query → LLM generates query embedding
  2. Vector Search → Elasticsearch finds relevant documents
  3. Context Retrieval → Documents provided to LLM
  4. Response Generation → LLM generates answer with context
  5. Return Result → Answer with citations

Implementation Example

import openai
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

class RAGSystem:
    def __init__(self):
        self.es = Elasticsearch(['http://localhost:9200'])
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        openai.api_key = "your-api-key"

    def retrieve_context(self, query, k=5):
        """Retrieve relevant documents from Elasticsearch"""
        query_embedding = self.model.encode(query).tolist()

        response = self.es.search(
            index="documents",
            knn={
                "field": "content_embedding",
                "query_vector": query_embedding,
                "k": k,
                "num_candidates": 100
            },
            fields=["title", "content"],
            _source=False
        )

        documents = []
        for hit in response['hits']['hits']:
            documents.append({
                'title': hit['fields']['title'][0],
                'content': hit['fields']['content'][0],
                'score': hit['_score']
            })

        return documents

    def generate_answer(self, query, context_docs):
        """Generate answer using LLM with context"""
        # Format context
        context = "\n\n".join([
            f"Document {i+1}: {doc['title']}\n{doc['content']}"
            for i, doc in enumerate(context_docs)
        ])

        # Create prompt
        prompt = f"""Answer the following question based on the provided context.
If the answer is not in the context, say "I don't have enough information to answer this question."

Context:
{context}

Question: {query}

Answer:"""

        # Call LLM
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=500
        )

        return {
            'answer': response.choices[0].message.content,
            'sources': context_docs
        }

    def query(self, question):
        """End-to-end RAG query"""
        # Retrieve relevant documents
        context_docs = self.retrieve_context(question, k=5)

        # Generate answer
        result = self.generate_answer(question, context_docs)

        return result

# Usage
rag = RAGSystem()
result = rag.query("How does Elasticsearch handle scalability?")

print("Answer:", result['answer'])
print("\nSources:")
for doc in result['sources']:
    print(f"- {doc['title']} (score: {doc['score']:.3f})")

Using Elasticsearch ML Models

Elasticsearch NLP Models

Elasticsearch 8.x includes built-in NLP capabilities:

Supported tasks:

  • Text embeddings
  • Named entity recognition (NER)
  • Sentiment analysis
  • Text classification
  • Zero-shot classification

Deploying a Model in Elasticsearch

Step 1: Upload model:

from eland.ml.pytorch import PyTorchModel
from elasticsearch import Elasticsearch

es = Elasticsearch(['http://localhost:9200'])

# Import model from Hugging Face
model = PyTorchModel(
    es,
    'sentence-transformers/all-MiniLM-L6-v2',
    task_type='text_embedding'
)

model.import_model()

Step 2: Deploy model:

POST _ml/trained_models/all-MiniLM-L6-v2/deployment/_start
{
  "number_of_allocations": 1
}

Step 3: Use model in ingest pipeline:

PUT _ingest/pipeline/text-embeddings
{
  "processors": [
    {
      "inference": {
        "model_id": "all-MiniLM-L6-v2",
        "target_field": "content_embedding",
        "field_map": {
          "content": "text_field"
        }
      }
    }
  ]
}

Index with automatic embedding generation:

POST /documents/_doc?pipeline=text-embeddings
{
  "title": "Elasticsearch ML",
  "content": "Machine learning features in Elasticsearch"
}

Best Practices

1. Embedding Model Selection

Considerations:

  • Vector dimensions vs. accuracy trade-off
  • Inference speed requirements
  • Model size and memory usage
  • Language support
  • Domain-specific models

Popular models:

  • all-MiniLM-L6-v2: 384 dims, fast, good general purpose
  • all-mpnet-base-v2: 768 dims, better accuracy
  • OpenAI text-embedding-3-small: 1536 dims, high quality
  • Domain-specific: Legal-BERT, BioBERT, etc.

2. Performance Optimization

Indexing:

  • Batch document indexing
  • Use bulk API
  • Optimize vector dimensions
  • Consider vector quantization

Search:

  • Adjust num_candidates parameter
  • Use filters to reduce search space
  • Cache query embeddings
  • Implement query result caching

Example with filtering:

POST /documents/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"category": "technology"}}
      ]
    }
  },
  "knn": {
    "field": "content_embedding",
    "query_vector": [...],
    "k": 10,
    "num_candidates": 100
  }
}

3. Cost Management

LLM API costs:

  • Cache embeddings (don't regenerate)
  • Batch embedding generation
  • Use smaller models when appropriate
  • Implement rate limiting

Elasticsearch costs:

  • Monitor index size
  • Use appropriate hardware
  • Optimize replica settings
  • Implement data lifecycle management

4. Quality Assurance

Evaluation metrics:

  • Precision@k
  • Recall@k
  • Mean Reciprocal Rank (MRR)
  • Normalized Discounted Cumulative Gain (NDCG)

A/B testing:

  • Compare different embedding models
  • Test hybrid vs. pure vector search
  • Measure user engagement
  • Monitor query performance

Advanced Techniques

1. Re-ranking

Improve results with two-stage retrieval:

def search_with_reranking(query, k=10):
    # Stage 1: Fast retrieval
    candidates = semantic_search(query, k=100)

    # Stage 2: Re-rank with more sophisticated model
    reranked = cross_encoder_rerank(query, candidates)

    return reranked[:k]

2. Query Expansion

Enhance queries with LLM-generated variations:

def expand_query(query):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "user",
            "content": f"Generate 3 alternative phrasings of this query: {query}"
        }]
    )

    return [query] + parse_alternatives(response)

def multi_query_search(query):
    expanded_queries = expand_query(query)
    all_results = []

    for q in expanded_queries:
        results = semantic_search(q, k=20)
        all_results.extend(results)

    # Deduplicate and rank
    return deduplicate_and_rank(all_results)

3. Contextual Compression

Reduce context size while preserving relevance:

def compress_context(query, documents):
    compressed = []

    for doc in documents:
        # Extract most relevant sentences
        sentences = extract_relevant_sentences(doc['content'], query)
        compressed.append({
            'title': doc['title'],
            'content': ' '.join(sentences),
            'score': doc['score']
        })

    return compressed

Frequently Asked Questions

Q: What's the difference between text search and vector search?
A: Text search matches keywords and uses statistical relevance, while vector search understands semantic meaning and finds conceptually similar content.

Q: Do I need to choose between text and vector search?
A: No, hybrid search combining both approaches often yields the best results.

Q: Which embedding model should I use?
A: Start with all-MiniLM-L6-v2 for speed and efficiency. Use larger models like all-mpnet-base-v2 or OpenAI embeddings for better accuracy.

Q: How do I handle large documents?
A: Split documents into chunks (e.g., paragraphs or sections), embed each chunk separately, and retrieve at chunk level.

Q: Can I use multiple embedding models?
A: Yes, store multiple embedding fields and query them separately or combine results.

Q: How expensive is it to run LLMs with Elasticsearch?
A: Embedding generation has one-time cost; vector search in Elasticsearch is efficient. Main ongoing cost is LLM API calls for answer generation.

Q: Do embeddings need to be regenerated when documents change?
A: Yes, update embeddings when document content changes significantly.

Q: Can I use open-source LLMs instead of OpenAI?
A: Yes, models like Llama 2, Mistral, or Falcon work well and can be self-hosted.

Q: How do I prevent LLM hallucinations?
A: Use RAG to ground responses in retrieved documents, set appropriate temperature, and validate answers against source content.

Q: What's the best way to evaluate search quality?
A: Combine automated metrics (precision, recall) with user feedback and manual relevance assessments.

Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.