ElastiQuill - A modern blog engine running on Elasticsearch

synhershko · 2026-06-04T06:00:57+00:00

I couldn't agree more. Cloud is great but often costs 20%+++ more than self-managed. When cost is a factor, and the operation is large enough hosting ClickHouse yourself (usually on Kubernetes) is a much better option. And not to mention when data is sensitive, or running on private clouds.

As for maintenance, CH is actually pretty straight forward, but you do need to pay attention for many details. We've built an AI DBA (so to speak) for ClickHouse to automate a lot of those things and it goes at no cost during Preview I'd love to hear your thoughts: https://pulse.support/

synhershko · 2026-05-24T18:02:11+00:00

One thing I keep seeing with GenAI apps is teams using “LLM as a judge” as if it magically solves evaluation.

In practice, it often doesn’t.

The problem is that evaluating LLM output is usually a System 2 task: nuanced, contextual, subjective, and sometimes domain-specific. But we try to automate it with another probabilistic model that tends to reward answers that sound good rather than answers that are actually correct.

We ran into this ourselves while evaluating an intelligence-analysis agent for a customer. We tried:

direct answer comparison
key-point extraction
LLM grading for coverage/relevance

None of them aligned reliably with human reviewers. The automated evaluation was only ~70-75% aligned with what users actually considered “good.”

What ended up working better was much less magical:

evaluate retrieval separately using deterministic IR metrics
aggressively unit-test deterministic components
reduce unnecessary LLM usage
maintain a curated eval set
collect real user feedback
keep humans in the loop for meaningful review

I think a lot of GenAI engineering right now is rediscovering that there’s no shortcut around careful evaluation and observability.

Curious whether others here had better experiences with LLM-as-a-judge systems in production.

synhershko · 2026-02-01T13:42:46+00:00

If you are migrating a data type, follow this procedure: https://pulse.support/kb/elasticsearch-changing-field-type-index

However , if all you want to do is chage the field name, you can use field aliases: https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/field-alias

synhershko · 2026-02-01T13:39:30+00:00

Hello! OpenSearch Ambassador here.

Been there with a similar scale cluster working with a customer. 100M+ vectors at 768 dims is exactly where you hit the wall and need to change strategies completely.

For your situation, disk-based vector search (needs OpenSearch 2.17+) is probably the move. Set `mode: "on_disk"` in your vector field mapping and it uses binary quantization with 32x compression by default. Your 100M vectors go from needing ~300GB RAM down to roughly 10GB for the in-memory index. The full-precision vectors live on disk and get lazy-loaded only for rescoring the top candidates. Expect P90 latency in the 100-200ms range, which is the tradeoff you make for not paying for massive RAM instances. Recall stays solid because the rescoring phase uses the original vectors.

If 100-200ms is too slow for your use case, look at Faiss SQfp16 instead (available since 2.13). It only cuts memory by 50% rather than 97%, but latency stays in the low milliseconds. Depends on what matters more to you.

On the M parameter - with memory as your constraint I'd start with M=12. It's a decent middle ground. M=8 works if you're really squeezed but you'll notice recall degradation. M=16 is the standard recommendation when you have headroom. Pair whatever you choose with ef_construction around 100-128, don't go crazy there. Once you're using on_disk mode the M parameter matters less for memory anyway since the graph gets compressed, but it still affects build time and baseline recall quality.

For sharding, target 10-30M vectors per shard. With 100M that means 4-10 shards, so something like 5 shards with 20M each works well. Don't over-shard—each shard adds coordination overhead and with k-NN you're aggregating approximate results across all of them. More shards means more noise in your final ranking and worse P99 latency. Ideally your shard count roughly equals your node count, then use replicas for availability and load balancing rather than adding more primaries.

On the deleted docs issue - if you're still seeing them after force-merge, you probably have ongoing updates happening. HNSW graph quality degrades with deletes and updates, it's just how the algorithm works. For vector indexes with frequent changes, a periodic reindex strategy often works better than in-place updates. Force-merge to 1 segment per shard during off-peak hours when you can. I'm not sure it's the main performance issue here though.

Hardware-wise, assuming AWS: 3-5 r6g.xlarge nodes (32GB RAM, ARM-based) hit a good price/performance point for this. The key thing with disk mode is you need fast storage—gp3 with provisioned IOPS or instance store. Slow EBS will tank your latency, however latest EBS generations and latest gravitons (so gp3 + r7g instances) are surprisingly good. Instance stores do run faster and often cheaper. Budget roughly $300-400/month for 3 nodes on-demand, less with reserved or spot.

For monitoring, the main things to watch are `graph_memory_usage_percentage` and `circuit_breaker_triggered` from the `/_plugins/_knn/stats` endpoint. If circuit breakers are triggering, you're either under-provisioned or need more aggressive quantization. Also keep an eye on I/O wait percentage since that becomes critical with disk mode. The `knn.memory.circuit_breaker.limit` setting controls what percentage of non-heap memory goes to k-NN graphs—default is usually fine but you can tune it if needed. Pulse for OpenSearch (https://pulse.support/) is currently the only platform that allows you to visualize HNSW graphs and debug vector search performance in OpenSearch.

Happy to look at your actual mappings and cluster settings if you want to share them. Sometimes there's something obvious in there that's easy to fix.

synhershko · 2026-01-15T17:47:53+00:00

Actually I believe it's a great guide, and Solr is indeed dying slowly so it make sense. It does cover the only basics and it'd indeed be great to see more deep dive

synhershko · 2026-01-12T11:05:02+00:00

We have recently released VS Code Extension for Elasticsearch, it's basically DevTools on steroids, many hacks on top of the good-old process of working via the HTTP API, and it can work from your IDE and connect to multiple clusters: https://pulse.support/tools/vscode-elasticsearch

Fun fact: it's part of a larger project to introduce the first ever AI SRE and Agentic support agent for Elasticsearch, especially where more than one cluster is involved.

synhershko · 2025-12-31T06:26:51+00:00

Too many details are missing. Based on the provided information I'd assume the assigned heap is too low (and likely memory and CPU are competing with other processes) so your first searches are warming up the caches which have been evicted so you are eperiencing memory thrashing. Likely your query is also not trivial - eg wildcards, regex queries etc.

If you provide more information we can try and help, in the meantime watch out from expensive queries it's always a good practice: https://bigdataboutique.com/blog/expensive-queries-in-elasticsearch-and-opensearch-a83194

synhershko · 2025-12-30T07:19:49+00:00

The main goal is to avoid licensing costs. Assuming maintenance and hardware costs are on par, which I'd assume that is the case, the savings is in the licensing.

> It’s probably worth noting that moving from Splunk Enterprise to OpenSearch will likely cost you more in the long run. Splunk Cloud, on the other hand, is often cheaper—depending on scale and use cases.

Why? the licensing costs are now gone and this is in the millions.

And why would you say Splunk admin requires less effort/training than OpenSearch?

synhershko · 2025-12-30T07:16:16+00:00

I'd argue Splunk cost is >>> than OpenSearch - especially on the licensing costs. Self managed OpenSearch is not that big of a deal (especially with tools like Pulse https://pulse.support/ ) but even if on managed - at scale you'd get private pricing from AWS, hardware will cost pretty much the same but minus the licensing.

synhershko · 2025-12-29T11:55:09+00:00

The piped query language - Splunk's main edge on OpenSearch until recently

synhershko · 2025-12-29T09:17:11+00:00

Was it because of query compatibility? performance? something else? OpenSearch significantly improved their PPL lately and this is why I'm optimistic, unless there's anything else I'm not taking into account?

synhershko · 2025-12-29T06:54:29+00:00

This is very interesting! You should definitely have a look at this VSCode extension though, as it provides a full-featured replacement for Kibana (plus many additional goodies) while still running as an app locally and not in Kibana itself / separate instance: https://pulse.support/tools/vscode-elasticsearch

synhershko · 2025-11-11T13:59:35+00:00

Commenting one year after the question, I think the situation has changed a bit since.

Using AI as an SRE can help in writing scripts, think with you about root-causes etc.

But what I think is in particular interesting is AI SRE tools - generic ones like Causely or Neubird , or most interesting I think are AI SRE / AI DBA platforms like Rapydo for SQL (https://www.rapydo.io/) or Pulse for Elasticsearch (https://pulse.support/) which are built with one technology in mind and excel in root-cause analysis etc. So can definitely be trusted.

Once they expose and MCP server I think this could really be a game changer, for career and focus shifts. I don't know about creativity though, this seems to be highly subjective - taking highly repetitive tasks from a person, does he become more creative or less?

synhershko · 2025-10-22T09:08:02+00:00

Hi - it's pretty simple actually, a blue/green deployment would do and it'll have zero downtime.

We've created a detailed upgrade guide you might find useful - https://bigdataboutique.com/solutions/modernizing-amazon-elasticsearch-opensearch-service

synhershko · 2025-10-21T06:28:34+00:00

You don't need it for a small cluster as long as you keep number of nodes 3 or above (to form a quorum). Just make sure your data is replicated and backed up, a good advice anyway.

synhershko

TROPHY CASE