Perspectives From The Front Row
A front row seat gives you a unique perspective.
And that’s how we feel about the state of Search in 2025.
As a company that provides AI-powered optimization and expert-led support for Elasticsearch and OpenSearch, we live and breathe search.
From hundreds of conversations with cluster maintainers, SREs, DevOps engineers, platform teams and architects, combined with the unfiltered signals we've tracked from the wider community in the form of Reddit threads, GitHub issues, X posts, and conference keynotes - we've developed a uniquely qualified perspective on what the community is talking about.
So as 2025 comes to a close, we’ve collected a list of the main talking points that we’ve observed.
This isn’t a formal industry survey with charts, methodology, or footnotes. It’s a real-life field snapshot from a team that is in the trenches of scalable search technology.
This is what the industry is talking about. And this is what we’ll be keeping an eye on going into 2026.
Licensing Uncertainty Is Forcing Technical Decisions
Licensing has stopped being background noise. It’s now a first-order architectural concern.
Teams are exhausted by shifting terms, unclear boundaries between open and paid features, and the feeling that yesterday’s assumptions may not hold tomorrow. Several engineers told us they felt like they needed legal context just to plan infrastructure.
That pressure is driving real migrations to OpenSearch - including large, production-scale clusters. These moves aren’t trivial, but many teams are choosing migration pain over long-term uncertainty.
That alone says a lot.
Cost Doesn’t Scale Linearly - It Compounds
Search costs don’t grow gradually. They compound.
Storage, replicas, compute, cross-AZ traffic, reindexing, backups - each layer adds cost, and many of the spikes show up unexpectedly. Teams described over-provisioning simply to survive merge cycles or peak indexing periods.
What surprised many wasn’t just how expensive search became - but how much senior engineering time went into understanding why. Cost management has become an operational discipline, not a finance afterthought.
Performance Fails Quietly Before It Fails Loudly
Few teams see clean failures. What they see is drift.
Latency creeps up. Garbage collection pauses get longer. Queries that were “fine” start timing out under load. By the time users complain, the system is usually already stressed.
Hot shards remain a common root cause. Poor routing keys, uneven tenant behavior, or traffic spikes overload a single node while the rest of the cluster looks healthy.
“Slow” is now treated as downtime - and that’s changed how teams think about risk.
Updates and Reindexing Are Still a Tax
Lucene’s update model continues to surprise people.
Partial updates still mean full rewrites. Soft deletes accumulate. Segment merges eat CPU and I/O. In update-heavy workloads - especially AI-driven ones - this becomes a constant background tax.
Many teams treat reindexing as a dangerous operation that must be scheduled, monitored, and sometimes throttled mid-flight. It’s not optimization - it’s survival.
AI and RAG Changed the Cost and Complexity Curve
Vector search demos look great. Production reality is harder.
Adding vector embeddings to existing workloads create large-scale reprocessing and significant cluster growth. HNSW graphs consume far more memory than teams initially expect. Several teams told us they discovered mid-project that their JVM heaps - and budgets - were no longer sufficient.
On top of that:
- chunking strategies are non-trivial
- metadata management becomes fragile
- hybrid search needs constant tuning
- re-ranking adds latency and orchestration complexity
There’s no shared playbook yet. Every team is solving these problems independently.
Cluster Operations Still Feel Fragile
Basic lifecycle operations remain stressful:
- removing nodes
- rebalancing shards
- rolling restarts
- schema changes
Even experienced teams rely on internal runbooks, timing windows, and caution to avoid outages. Automation helps - until it doesn’t. Several operators shared stories where “self-healing” scripts made things worse by triggering merge storms or cascading failures.
The lesson we heard repeatedly: automation without context is dangerous.
Observability Exists - Insight and Operational Intelligence Do Not
Teams have dashboards. They have metrics. They have alerts.
What they lack is clarity.
During incidents, engineers spend time trying to correlate signals to answer basic questions:
- Is this noise or the start of something serious?
- Which action actually reduces risk?
- Is this a symptom of a larger issue we haven’t identified yet?
At scale, this problem gets worse - not better.
Observability itself becomes expensive, forcing teams to make tradeoffs about what data they keep, for how long, and at what level of detail. Log volume, retention policies, tiered storage, and cold data latency turn visibility into a budget discussion as much as a technical one.
The result is a vicious cycle: teams have more signals than ever, but less usable context, insight, and operational intelligence when they need it most.
Kubernetes Didn’t Magically Fix Stateful Search
Running search on Kubernetes promised flexibility and standardization. In practice, it also introduced new failure modes.
Generic Kubernetes primitives don’t understand shard placement, cluster state, or the blast radius of node-level changes. As a result, routine operations - node upgrades, pod evictions, or storage resizing - can trigger massive data movement and recovery storms if they aren’t carefully coordinated with the search layer.
This gap is why search-aware tooling has emerged around Kubernetes. Operators like the OpenSearch Kubernetes Operator exist to encode domain knowledge that Kubernetes itself lacks: coordinating rollouts with cluster health, sequencing changes safely, and making topology and data placement explicit rather than accidental.
Cost-focused architectures introduced their own surprises. Moving data to colder tiers helped control spend, but teams often discovered the latency impact only after queries slowed and dashboards timed out. Explaining those tradeoffs to product stakeholders became an unexpected part of running search.
Kubernetes works for search - but only when paired with tooling and operational awareness that understands how stateful search systems actually behave.
Too Much Knowledge Still Lives in People
The most consistent risk we heard wasn’t technical.
Search clusters still depend heavily on a small number of people who understand historical decisions, tuning choices, and past failures. When those people are unavailable, teams slow down or stall.
Despite automation and tooling, search remains deeply human-dependent infrastructure.
Teams are increasingly realizing the importance of mitigating this risk with capable support partners and tooling that streamlines cluster maintenance communications and processes.
What This Tells Us Heading Into 2026
Across all of these conversations, a consistent pattern stood out:
Teams are looking for clarity and breathing room.
Predictable licensing. Understandable costs. Safer operations. Monitoring that goes beyond just monitoring to explain what’s happening - and what to do next.
Search has become critical infrastructure for modern applications and AI systems. But operating it is still harder than most teams expect.
Here’s to continued innovation in 2026.