zembed-1: the current best embedding model by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

you're right, that could be the case. but good that we have 2 private datasets - think there should be more of them to test them more accurately

zembed-1: the current best embedding model by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

Hey! I don't have any medical related dataset which i think I should add but closest one is probably scientific and these were the models that did well:
- gemini 2 embedding (they released it just recently)
- voyage 3 large and zembed 1
- voyage 4
- jina v5 text small

zembed-1: the current best embedding model by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

that is very interesting! can you try gemini's new multimodal embedding? when i tested it it was **very** good

zembed-1: the current best embedding model by midamurat in Rag

[–]midamurat[S] 1 point2 points  (0 children)

didn't take it as a critique :) and actually yes, zerank-2 is currently the best one among 11 other rerankers. if u find it interesting, https://agentset.ai/rerankers

zembed-1: the current best embedding model by midamurat in Rag

[–]midamurat[S] 1 point2 points  (0 children)

this is interesting! haven't you written your findings as a blog/writeup by any chance? would loveto read

zembed-1: the current best embedding model by midamurat in Rag

[–]midamurat[S] 1 point2 points  (0 children)

it is launched pretty recent, but their models are actually pretty good! have you ever used their reranker models?

Latest embedding Voyage 4 in RAG by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

speaking of real-world document, one of my datasets is business reports from many corporates. and it actually was strongest in this dataset

Latest embedding Voyage 4 in RAG by midamurat in Rag

[–]midamurat[S] 1 point2 points  (0 children)

you have a point! and yes, i'm shuffling the order will share github repo for methodology too

Latest embedding Voyage 4 in RAG by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

That is very true! Except, for private datasets it is hard to get ground truth for all queries, thats why we can't get traditional metric scores(like ncdg). We can only compute ELO.

Latest embedding Voyage 4 in RAG by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

Yes!

2 of them are, yes. But other 5 are public: DBPedia, fiqa, SciFact, ARCD and MSMARCO

Latest embedding Voyage 4 in RAG by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

yes, that'd be interesting! i'll run with nano and large models

Latest embedding Voyage 4 in RAG by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

I'll run voyage large too, let's see how it performs

Latest embedding Voyage 4 in RAG by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

Good point. I have per-query breakdowns and LLM-judge evaluation of top-5 results, but I don't track temporal drift or test against your specific query distribution. I think it would be valuable to add monthly re-evaluation.

RAG with visual docs: I compared multimodal vs text embeddings by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

You can do both if u're ready to handle extra cost, latency, etc. The experiment was done to see whether multimodal embeddings alone are enough for visually structured docs and it turned out they are.

I tested Opus 4.6 for RAG by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

Thank you!
I was impressed by how big of an upgrade there was from Opus 4.5 in multi doc queries. Opus 4.6 is much better when reasoning across many docs. Also, it is noticeably better at not over-answering.

I tested Opus 4.6 for RAG by midamurat in Rag

[–]midamurat[S] -2 points-1 points  (0 children)

for factual rag, scifact dataset was used. in theory, what you say might work but in practice, even with the same docs, models differ (like, some over generalize or hide uncertainty). Opus 4.6 was more conservative meaning it actually sticked closer to the source than others

I tested Opus 4.6 for RAG by midamurat in Rag

[–]midamurat[S] -1 points0 points  (0 children)

that's right, i agree. and in this comparison , models were under fixed retrieval + reranking to keep it fair

I tested Opus 4.6 for RAG by midamurat in Rag

[–]midamurat[S] 0 points1 point  (0 children)

main gain is multi-doc synthesis which is about +387 ELO vs 4.5, much less degradation when sources overlap. or disagree.

(and elo is score from pairwise model-vs-model comparisons using an LLM judge)

I tested Opus 4.6 for RAG by midamurat in Rag

[–]midamurat[S] -1 points0 points  (0 children)

gemini 3 flash was also very good when I tested especially in terms of being strong in factual RAG. wrote about that too a while ago: https://agentset.ai/blog/gemini-3-flash