I Went Through 90 Days of OpenSearch Community Slack Messages. Here's What I Learned

A field report on what the OpenSearch community is actually talking about, from recurring pain points and unanswered questions to trends shaping self-managed search operations in 2026.

If you know Pulse, you know we're deeply embedded in the OpenSearch ecosystem.

That's not only because our product is built to automate and optimize OpenSearch maintenance, but also because our team is led by an OpenSearch Ambassador and the maintainer of the OpenSearch Kubernetes Operator — and our engineers spend their days helping organizations run OpenSearch and Elasticsearch in production.

Put simply: we spend a lot of time in the community — on Slack, forums, GitHub, meetups — not as bystanders, but as participants. It's how we stay connected to what fellow OpenSearch practitioners actually struggle with.

Recently, it occurred to me that with all of the content and chatter throughout the OpenSearch community, there's no organized way to zoom out and look at what's actually being discussed.

The OpenSearch community is active and generous — people share real configs, real error logs, real frustrations — but the conversations stay fragmented across Slack channels and forum threads.

A question asked in #observability might never reach the person struggling with the same thing in #general. A pattern that's obvious when you look across 10 channels is invisible when you're only in one.

The community's collective knowledge is larger than any individual member's view of it.

So I decided to try something.

I went through the last 90 days of conversation across the OpenSearch Slack workspace and the community forum — thousands of messages from thousands of OpenSearch users across #general, #k8s-operator, #observability, #ml, #vector-search, #dashboards, #dev, and more — and pulled out the recurring themes, the unanswered questions, the pain points that kept showing up across channels, and the signals that hint at where things are heading.

What follows isn't a scientific study. It's a field report from the trenches.

The Big Five OpenSearch Themes

If you compressed 90 days of community conversation into five themes, these are what you'd get.

1. "How do I monitor this thing?"

This was the single most recurring question, in various forms, across multiple channels. And the answers were... mostly silence.

One user posted directly in #general: "How do people monitor OpenSearch 3.x? Performance Analyzer is busted and not planned to be fixed from what I can see" — and linked to the GitHub issue confirming it. Another user asked the same question in #observability, then cross-posted it because the first post didn't get enough traction. Someone else reported that their Prometheus-based monitoring broke entirely after upgrading to 3.x — CPU metrics returning -1, index-level metrics just... gone.

And then there's the user who asked — twice, in two different channels — how to monitor k-NN metrics via Prometheus. No definitive answer either time. The metrics exist via API, but they're not exposed to Prometheus natively.

One community member was so frustrated by this gap that they built and open-sourced their own custom Prometheus exporter for neural search and k-NN plugin metrics. That's either admirable dedication or a sign that something is very broken. Probably both.

The monitoring question isn't going away. If anything, as Performance Analyzer fades from the picture in 3.x, it's going to get louder.

2. Upgrading is scarier than it should be

Upgrade conversations were everywhere — 2.17 to 2.19, 2.18 to 2.19, 2.x to 3.x, and everything in between. The tone ranged from cautiously optimistic to genuinely distressed.

On the optimistic end: one user shared their entire production Helm chart — JVM settings, Prometheus configuration, role separation, cluster-level tuning — and asked the community to validate their 2.18 → 2.19 upgrade approach. That level of openness is one of the best things about this community.

On the distressing end: a user upgraded from 2.16 to 2.17.1 and hit a KNN codec bug that caused Lucene commit failures across multiple indices. Their recovery process? Restore each failed index from a snapshot backup and re-index. One by one. If you've ever had to do this, you know exactly how that feels.

The forum mirrored this — a thread titled "Update from 2.19 to 3.4.0-1 fails to start" accumulated 8 replies and 74 views, which is high engagement for the OpenSearch forum. The 2.x-to-3.x jump in particular is generating anxiety: breaking changes, deprecated features (hello again, Performance Analyzer), and settings that silently stop working.

3. Vector search is amazing and also breaking my cluster

k-NN and vector search questions have exploded — but the conversations have shifted. They're not "how do I set up vector search" anymore. They're "how do I run 140 million vectors without my circuit breakers catching fire."

One user posted a detailed cry for help: 140+ million vectors on FAISS HNSW, KNNGraphMemoryUsage at 100%, circuit breaker tripping, writes blocked, search queue rejections during load tests. Their questions — about capacity planning, memory limits, on_disk vs. in_memory, shard count impact — are all the right questions. But the community didn't have complete answers.

Another team shared detailed benchmark results showing that filtered ANN search was outperforming pure ANN search at high index fill rates — the opposite of what you'd expect. They posted the data, the configuration, the test methodology. The question is still open.

And a user created an index without specifying a k-NN engine, ran a filtered query, and got hit with "Engine [NMSLIB] does not support filters." Turns out NMSLIB is the default, and the documentation doesn't make this obvious. Welcome to vector search in production.

The community is past the adoption phase and into the "this works on my laptop but not at scale" phase. Memory management, engine selection, filtered search behavior, and hybrid query optimization are the new hard problems.

4. Kubernetes is the deployment model, and it's complicated

The Kubernetes Operator 3.0 alpha release was one of the bigger community events of the month. It brought quorum-safe rolling restarts, multi-namespace support, TLS hot reloading, and over 100 meaningful changes. The community response was a mix of excitement and "I found a bug on day one" — which is, frankly, exactly what an alpha release is for.

TLS certificate generation issues, storageclass configuration problems, and questions about controlling dynamic cluster settings via the operator Helm chart were all active threads. One engineer at a well-known open-source foundation asked whether there's any way to manage cluster-level settings declaratively through the operator — the answer, currently, is "not really." Their proposed workaround: a git-sync sidecar that applies settings from a repo. DIY DevOps at its finest.

Meanwhile, one of the most impressive posts of the month came from a user running 18 data nodes and 3 dedicated masters on Kubernetes who was seeing 600 MBps of network traffic from the master node during rolling restarts — because their cluster had 15,000 shards producing 700MB of cluster state metadata. The cluster was going red during routine maintenance. This is the kind of problem that doesn't show up until you're deep into production at scale, and it's a reminder that Kubernetes makes deployment easier but doesn't make operations simpler.

5. Alerting is... not great

A steady drumbeat of alerting frustration ran through the month:

An engineer on OpenShift set up a per-query monitor, but after acknowledging an alert, it wouldn't re-trigger even though the underlying condition persisted. They've been looking for a fix since 2021 (they linked to a forum post from that year).

Another user tried to include custom fields in alert notifications — and — but the rendered messages came through empty. The alert fires, but it tells you nothing useful.

A security analytics user reported that a single failed logon event generated 1,000 alerts. One thousand. Their question — "is there an alert aggregation facility?" — is the kind of thing you ask calmly while screaming internally.

Someone just wanted to send HTML-formatted alert emails instead of plain text. That one remains unsolved.

Native alerting is a consistent pain point, and the community doesn't have great answers for each other yet. The gap between "I set up an alert" and "this alert actually helps me respond to incidents" is wider than it should be.

The Questions Nobody Answered

Some of the most interesting signals weren't the questions themselves. They were the questions that got no reply, or got replies that amounted to "I have the same problem."

"How do I monitor k-NN metrics in Prometheus?" — Asked twice, in two channels, by the same person. No answer. The metrics exist via the _plugins/_knn/stats API but aren't exposed natively to Prometheus.

"Why are my deleted documents never clearing through segment merges?" — A user posted detailed analysis showing segments reorganizing but never expunging deletions. After an extended investigation, they wrote: "I'm admittedly at a loss. Can anyone suggest even a place to start?" No resolution.

"Is it expected that filtered ANN outperforms pure ANN at high fill rates?" — Detailed benchmark data, clear methodology, specific configuration. The question remains open.

"How do I manage dynamic cluster settings declaratively through the K8s operator?" — Asked by an engineer at a major open-source organization. No built-in mechanism exists.

"What actually replaces Performance Analyzer in 3.x?" — The meta-question that hangs over everything.

Unanswered questions are where the community needs the most help — and where anyone who can answer builds the most trust.

The Standout Contributions

Not everything was questions and problems. Some community members went above and beyond:

The production config reviewer. Someone shared their complete Kubernetes Helm configuration — JVM settings, Prometheus exporter config, role separation, cluster-level tuning, everything — and asked the community to validate it before a production upgrade. That's vulnerability in service of the community, and it's the kind of post that makes open-source communities work.
The DIY metrics exporter. When the native Prometheus exporter didn't expose k-NN and neural search plugin metrics, someone built a custom exporter, open-sourced it on GitHub, and shared it with the community. Saw a gap, filled it, moved on.
The 503 detective. One engineer documented a multi-week debugging odyssey across four clusters — all experiencing mysterious 503 all_shards_failed errors despite every health indicator showing green. They tried shard reroutes, full node cycling, and index check_on_startup. The ultimate fix? Close the index and reopen it. Their comment — "I'm not sure how I feel about 'turn it off and on again'" — is the most relatable line in the entire 30-day dump.
The Docker Compose builder. Couldn't find a good multi-node Docker Compose setup for OpenSearch, so they built one, published it on GitHub, and asked for feedback. This is how community tooling gets born.
The university students. A CS student announced they're building their final year project around making OpenSearch more user-friendly — "lowering the knowledge required to use it." They asked the community: "What are the annoying parts of OpenSearch that need extra support?" Their proposal defense was that Friday. I hope it went well.

By the Numbers

A snapshot of what the month looked like:

Most active channels: #general, #k8s-operator, #observability, #ml, #vector-search
Most discussed OpenSearch version: 3.4 (the latest release), followed by 2.19
Most referenced external tool: Prometheus, followed by Grafana, Kafka, and Data Prepper
Users who asked about monitoring: At least 5 distinct users across different channels — and that's just the ones who posted publicly
Users who asked the same question in multiple channels (because the first one went unanswered): At least 3
Users who shared full production configs or detailed error logs publicly: More than a dozen. This community does not hide behind vague descriptions.
Messages that included complete YAML, JSON, or Helm configs: Surprisingly many. People aren't just describing problems — they're showing their work.
Number of times someone new introduced themselves with some version of "I'm new to OpenSearch": At least 6, including the university student planning their thesis around it

What (I Think) Is Coming

Based on what I'm seeing in community conversations, here’s an educated guess regarding the trends that could continue to accelerate:

Vector search operations will become its own discipline. The questions are getting more sophisticated — memory management at scale, engine tuning, benchmark analysis, hybrid query optimization with MMR reranking. This isn't "how do I add vectors to my index" anymore. It's "how do I run vectors reliably in production at 100M+ scale." Expect dedicated tooling, guides, and expertise to emerge around this specific challenge.

The "who monitors the monitor?" problem will get louder. OpenSearch 3.5 just added Prometheus as a first-class data source in Dashboards. The OTel-to-Data-Prepper-to-OpenSearch pipeline is becoming a standard architecture. More teams are using OpenSearch as their observability backend — which means the health of the OpenSearch cluster itself becomes mission-critical in a recursive way. If OpenSearch goes down, you lose visibility into everything. This makes cluster monitoring an existential concern, not a nice-to-have.

Kubernetes will become the default deployment model for self-managed. The operator 3.0 release, the volume of K8s-related questions, and the types of problems being discussed all point in this direction. Docker Compose and bare-metal setups are still common, but the center of gravity is shifting. The flip side: K8s makes deployment easier but makes operational debugging harder. Expect more questions about persistent volumes, resource limits, rolling update strategies, and the interaction between K8s orchestration and OpenSearch's own cluster management.

AI and agentic features will create new operational challenges. The #ml channel was active with model registration issues, Bedrock connector problems, MCP integration questions, and context window management for AI agents running against OpenSearch. As teams deploy AI features on top of their clusters, the operational surface area grows — more models to manage, more resource contention, more things that can break in interesting ways. Meanwhile, the Observability TAG is building an agent evaluation framework with real-time trace visualization. The frontier is moving fast.

What Did I Miss?

Again, this is a snapshot, not a census. There are almost certainly missed conversations in DMs, threads I didn't follow deep enough, and signals in channels I didn't cover as thoroughly.

If you're part of the OpenSearch community and you saw a theme I didn't capture — or if you're the person who asked one of the unanswered questions and you've since found an answer — I'd genuinely love to hear about it.

I'm in the Slack channel, and I'm always happy to hear about your burning cluster problems.