About Mark
Mark is an Ex-Elastic Core Developer & Lucene Committer, joining both in the early stages of their development. Aside from his open-source work, Mark has worked extensively on commercial search and analytics tools used to help various forms of investigators chase bad guys.
While he has worked predominantly in back-end functionality, he has a strong interest in front-end design and how to improve search as a conversation with users.
Where to find Mark on the web:
Let’s start from the beginning — how did you get involved in the search tech industry?
I first started experimenting with Lucene back when it was still hosted on SourceForge. At the time, my day job involved writing fairly routine database CRUD applications, but search and open source felt like a much more interesting and open-ended challenge. I’d tinker with Lucene in the evenings, just curious about how it worked and picking up a lot from the project and the community along the way. My first contribution was the highlighter because wanting to know why certain results matched felt like such a common need. The XMLQueryParser was another contribution, mainly to make Lucene more accessible to non-Java clients. Eventually, my employer also took an interest in Lucene, and I spent several years building a proprietary distributed search and analytics platform around it. We had a lot of neat features (geo-search, entity extraction, graphical query building, significant keywords, map/graph/bar visualizations), but had the audacity to ask clients for money. When Elasticsearch came around, we found it hard to compete, so I jumped ship. I was very happy to join Elastic’s core search team and work on open source with such a great team.
Tell us about your current role and what you’re working on these days.
Since leaving Elastic, I’ve mostly been enjoying retirement, but I still find myself tinkering with pet search projects.
There’s been a lot of innovation in the search space over the past few years, and it’s been hard to stay away completely. Vectors offer an exciting complement to traditional ways of representing concepts, such as unstructured text or structured codes. I’ve been exploring how they can play more of a role in the search and discovery process through interactive client-side clustering and am releasing this work as open-source
Could you describe a ‘favorite failure’—a setback that ultimately led to an important lesson or breakthrough in your work?
While working on a bid for the UK’s Police National Database, I ran into a limitation in Lucene that had to be addressed. The source data consisted of structured XML records—people with multiple associated addresses, vehicles, and so on. But Lucene’s indexing model flattened that hierarchy, which led to the problem of “cross-matching”.
For example, a search for someone named “John” with a blue BMW might incorrectly match a record for John with a red BMW and a blue Ford. In Lucene’s eyes, all those tokens lived in a single flat document, so it couldn’t distinguish which attributes belonged together. For law enforcement use, these kinds of false positives aren’t acceptable.
The solution I developed eventually became Lucene’s BlockJoinQuery, now known as nested documents in Elasticsearch. It was a nice workaround that didn’t require changes to the Lucene file formats or existing query types, and it allowed queries to respect the complex document structure while remaining fast.
What are some of the biggest misconceptions about search that you often encounter?
1. “Vectors are always produced by sending text to a model.” Instead of treating vectors as opaque backend artefacts, we can let users shape and combine them on the client using dynamic search result clustering tools. Vectors aren’t just output—they’re creative material. Think plasticine, not porcelain.
2. “Search UX is keyboard-input, then filter-down” The typical flow assumes users type a query and then use mouse clicks or touch gestures on facets to narrow results. But why assume every clicked discovery must be ANDed with the original input? In many cases, adding a facet could usefully expand the results via OR logic. Unfortunately, most user interfaces, and even many database APIs, prohibit this interaction.
3. “Users can’t handle Boolean logic”. Raw Boolean syntax is certainly unfriendly, but the underlying concepts aren’t. I’ve seen a visual, flow-based interface turn analysts and investigators with little search experience into confident power users within minutes. By replacing AND/OR syntax with simple graphical blocks, the interface removes a major usability barrier and also unlocks access to query types that have no representation in text, such as span queries, geo constraints, and vector-based similarity. Text-based query languages limit both who can use them and what can be expressed.
4. “You can have Fast analytics, Accurate results, and Big data—all at once.” In practice, it’s a “pick two” scenario I call the “F.A.B” trilemma. Elasticsearch aggregations were primarily optimised for FB, sacrificing Accuracy.
How do you envision AI and machine learning impacting search relevance and data insights over the next 2-3 years?
One of my biggest concerns about AI right now is its economic impact on the search ecosystem.
We’ve already seen a move away from the “10 blue links” model, where search engines returned what they believed were the most relevant results. Now AI systems are increasingly layered on top of search, summarising results, providing answers directly and taking over the reading step that users once performed.
This shift has major implications for content providers. Many of them rely on ad revenue driven by page views, and in the pre-AI era, there was a symbiotic relationship: sites like Stack Overflow allowed their content to be indexed because search engines, in return, sent traffic their way. But now, with AI-generated answers being surfaced above or instead of source links, whether in ChatGPT, Google, or elsewhere, that flow of traffic has dropped dramatically (see youtu.be/H5C9EL3C82Y)
It’s a profound disruption, and the business model for content providers is now under real pressure.
Are there any open-source tools or projects—beyond Elasticsearch and OpenSearch—that have significantly influenced your work?
Lucene holds the most significance for me. I have so much admiration for the developers there and I owe much of my career to Doug Cutting’s decision to give his search engine to the world. It’s amazing to see it is still so relevant after all these years and how it has adapted to encompass geo, temporal, numeric and vector queries in the one engine.
Is there a log error/alert that particularly terrifies or annoys you?
I always found multi-threading issues to be the worst because you can’t always reliably reproduce them and need to recreate the exact same sequencing of operations for them to reveal themselves.
What is a golden tip for optimizing search performance that you’ve picked up in your years of experience?
I used to joke my main job function was “avoider of disk seeks”. Too many random disk seeks (even with SSDs) was always the big performance killer in a search and so that was always a large factor in designing solutions.
What is the most unexpected or unconventional way you’ve seen search technologies applied?
The widespread adoption of a text search tool (Lucene) for logging and analytics was a surprising turn of events for me.
Another unexpected use case was that the word-usage counts held in the index (intended for relevance ranking of search) turned out to be very useful for discovery. Having access to word-usage stats allowed us to identify the statistically significant keywords in search results. This capability helped with everything from recommending related music to flagging uploads of terrorist-training videos.
If you're building something from scratch, what does your ideal search tech stack look like?
Elasticsearch is a natural choice for me - it’s the back end I know most about and I can always get it to do what I want. I tend to use Python to ingest data because it’s a quick way to glue things together.
In the front end, I use the Vue framework, and I’ve developed a set of Vue components that help me get the most out of Elasticsearch. Instead of typing complex Boolean query strings, I can build queries visually by dragging terms from bar charts, significant keyword suggestions, or directly from matching documents into a graphical query builder. Each clause, whether it’s a keyword, tag, vector cluster, or a combination, is represented as a visual block, and I can combine them using AND, OR, and NOT logic in a flow-based layout.
When the query runs, a profiler can highlight which clauses matched and which didn’t, giving me full transparency into how the logic is behaving. It makes it much easier to experiment, refine queries, and explore whatever dataset I’m working with.
Give us a spicy take/controversial opinion on something related to Search
The search industry created a footgun when it gave users analytic capabilities on top of a fuzzy matching engine.
It reminds me of Dr. Evil’s infamous wish for “sharks with frickin’ laser beams”—a highly precise tool paired with something inherently erratic and unpredictable. The analytics layer offers exact counts- how many results match in each group, facet, or bucket. But the underlying matching engine is often fuzzy, using partial term matches, typo tolerance, or vector similarity. Some matches are solid. Others are barely related.
We then take these counts and render as clean bar charts with hard edges, implying a level of precision and certainty that simply isn’t there. Users can easily misread this as a strong signal of popularity or relevance, when in fact the set might be full of weak, borderline matches.
I used to joke that Kibana’s bar charts shouldn’t have sharp tops and that they should fade out in gradients, reflecting the range of match strengths underneath. I don’t have a clean fix for this. It’s an open design question: how do we visually convey the uncertainty in match strength without losing trust or usability?
Can you suggest a lesser-known book, blog, or resource that would be valuable to others in the community?
One non-technical book that left a big impression on me was Donald Norman’s classic The Design of Everyday Things. We go through life surrounded by examples of bad design (like doors with handles that invite you to pull when they should be pushed) and often blame ourselves when things go wrong.
Norman flips that perspective. He shows how design shapes behaviour, and how thoughtful choices can prevent confusion and errors. That was a light bulb moment for me. His principles apply not just to physical objects but also to user interfaces, APIs, and systems design more broadly. It’s a book that taught me to question assumptions and think more critically about how people interact with the things we build.
Anything else you want to share? Feel free to tell us about a product or project you’re working on or anything else that you think the search community will find valuable
Lately, I’ve been working on an open-source clustering project based on binary vectors. I think this approach has the potential to change how users interact with search engines by making result sets more explorable and conversational. Instead of only working with fixed, rigid inputs like keywords or checkbox filters, users can shape and combine clusters to express intent more fluidly. You can see more at https://qry.codes.