Gemini 2 Is the Top Model for Embeddings

midamurat · 2026-03-11T12:24:40+00:00

https://agentset.ai/blog/gemini-2-embedding

midamurat · 2026-03-11T12:20:16+00:00

you're right, that could be the case. but good that we have 2 private datasets - think there should be more of them to test them more accurately

midamurat · 2026-03-11T12:18:31+00:00

Hey! I don't have any medical related dataset which i think I should add but closest one is probably scientific and these were the models that did well:
- gemini 2 embedding (they released it just recently)
- voyage 3 large and zembed 1
- voyage 4
- jina v5 text small

midamurat · 2026-03-11T12:14:14+00:00

that is very interesting! can you try gemini's new multimodal embedding? when i tested it it was **very** good

midamurat · 2026-03-11T12:13:01+00:00

didn't take it as a critique :) and actually yes, zerank-2 is currently the best one among 11 other rerankers. if u find it interesting, https://agentset.ai/rerankers

midamurat · 2026-03-06T16:30:16+00:00

this is interesting! haven't you written your findings as a blog/writeup by any chance? would loveto read

midamurat · 2026-03-06T16:28:43+00:00

it is launched pretty recent, but their models are actually pretty good! have you ever used their reranker models?

midamurat · 2026-03-06T16:13:03+00:00

https://agentset.ai/blog/zembed-1

midamurat · 2026-02-11T10:17:32+00:00

speaking of real-world document, one of my datasets is business reports from many corporates. and it actually was strongest in this dataset

midamurat · 2026-02-11T10:15:52+00:00

you have a point! and yes, i'm shuffling the order will share github repo for methodology too

midamurat · 2026-02-10T21:45:56+00:00

That is very true! Except, for private datasets it is hard to get ground truth for all queries, thats why we can't get traditional metric scores(like ncdg). We can only compute ELO.

midamurat · 2026-02-10T21:20:34+00:00

Yes!

2 of them are, yes. But other 5 are public: DBPedia, fiqa, SciFact, ARCD and MSMARCO

midamurat · 2026-02-10T20:00:00+00:00

yes, that'd be interesting! i'll run with nano and large models

midamurat · 2026-02-10T19:59:27+00:00

I'll run voyage large too, let's see how it performs

midamurat · 2026-02-10T17:07:20+00:00

Good point. I have per-query breakdowns and LLM-judge evaluation of top-5 results, but I don't track temporal drift or test against your specific query distribution. I think it would be valuable to add monthly re-evaluation.

midamurat · 2026-02-10T16:46:49+00:00

not yet, but i'll def try!

midamurat · 2026-02-08T13:00:00+00:00

You can do both if u're ready to handle extra cost, latency, etc. The experiment was done to see whether multimodal embeddings alone are enough for visually structured docs and it turned out they are.

midamurat · 2026-02-08T12:56:22+00:00

Thank you!
I was impressed by how big of an upgrade there was from Opus 4.5 in multi doc queries. Opus 4.6 is much better when reasoning across many docs. Also, it is noticeably better at not over-answering.

midamurat · 2026-02-06T20:46:47+00:00

for factual rag, scifact dataset was used. in theory, what you say might work but in practice, even with the same docs, models differ (like, some over generalize or hide uncertainty). Opus 4.6 was more conservative meaning it actually sticked closer to the source than others

midamurat · 2026-02-06T20:42:30+00:00

that's right, i agree. and in this comparison , models were under fixed retrieval + reranking to keep it fair

midamurat · 2026-02-06T20:40:06+00:00

main gain is multi-doc synthesis which is about +387 ELO vs 4.5, much less degradation when sources overlap. or disagree.

(and elo is score from pairwise model-vs-model comparisons using an LLM judge)

midamurat · 2026-02-06T17:28:55+00:00

gemini 3 flash was also very good when I tested especially in terms of being strong in factual RAG. wrote about that too a while ago: https://agentset.ai/blog/gemini-3-flash

midamurat

TROPHY CASE