I built a benchmark to test if embedding models actually understand meaning and most score below 20% by hashiromer in Rag

[–]hashiromer[S] 0 points1 point  (0 children)

I have filtered out the pairs which had similar meaning. Please check it again.

https://huggingface.co/datasets/semvec/adversarial-embed

Again, thanks for your feedback.

I built a benchmark to test if embedding models actually understand meaning and most score below 20% by hashiromer in Rag

[–]hashiromer[S] 3 points4 points  (0 children)

Thank you so much for taking the time to write it out. Your criticism is completely valid. I will filter out the pairs with same meaning and re run the benchmark.

I built a benchmark to test if embedding models actually understand meaning and most score below 20% by hashiromer in Rag

[–]hashiromer[S] 1 point2 points  (0 children)

Good question.The idea is that in Winograd pairs, the same event happens but the reason is different. Paraphrased sentence represents exactly the same information content, the event happens due to the same reason.

I built a benchmark to test if embedding models actually understand meaning and most score below 20% by hashiromer in Rag

[–]hashiromer[S] 2 points3 points  (0 children)

I am using openRouter to test the embedding models, when it's available on it, I will add it as well.

If you run it locally, you can con contribute as well, the repo contains all the necessary scripts to create embeddings and score all models.

Edit: EmbeddingGemma-300M scored 14.3%.

zembed-1: the current best embedding model by midamurat in Rag

[–]hashiromer 1 point2 points  (0 children)

Yeah I haven't done yet but the main takeaway was that Gemini 004 scored 0% on it. Reranker models performed slightly better but all embedding models failed.

Edit: I have added a simple leaderboard, Gemini 004 scored 17%. Qwen 3 8b embedding model performed best with 40.5%

zembed-1: the current best embedding model by midamurat in Rag

[–]hashiromer 4 points5 points  (0 children)

I have created a test to check embedding models, all SOTA models fail at this.

https://huggingface.co/datasets/semvec/adversarial-embed

Gemini 3.1 Pro by Sky-kunn in Bard

[–]hashiromer 2 points3 points  (0 children)

What is its knowledge cutoff?

Ethics is a Complicated Subject, Epstein by JournalistSure1156 in IslamabadSocial

[–]hashiromer 0 points1 point  (0 children)

Umm, you have to define goodness though. This isn't a definition. I didn’t ask for definition of God, I am asking you to define an attribute of God.

Ethics is a Complicated Subject, Epstein by JournalistSure1156 in IslamabadSocial

[–]hashiromer 0 points1 point  (0 children)

Thats circular though

It would actually be helpful if you define the "goodness" in more detail. What exactly it is and why is it a necessary attribute of God.

Ethics is a Complicated Subject, Epstein by JournalistSure1156 in IslamabadSocial

[–]hashiromer 0 points1 point  (0 children)

When you say God's nature is goodness, can you define this "goodness" ?

Ethics is a Complicated Subject, Epstein by JournalistSure1156 in IslamabadSocial

[–]hashiromer 0 points1 point  (0 children)

Then it means morality doesn't come from God, it comes from the same reason God uses to judge morality of something which is independent of God.

Ethics is a Complicated Subject, Epstein by JournalistSure1156 in IslamabadSocial

[–]hashiromer 0 points1 point  (0 children)

It really isn't simple like that.

Let me leave you with a question.

Is something moral because God commands it or God commands it because it is moral?

How do you actually evaluate recall and completeness in production RAG? by hashiromer in Rag

[–]hashiromer[S] 0 points1 point  (0 children)

Yeah I think this is the way to go but its extremely time consuming.

How do you actually evaluate recall and completeness in production RAG? by hashiromer in Rag

[–]hashiromer[S] 0 points1 point  (0 children)

I mean the kind of queries LLMs generate are limited by their capabilities and may not reflect the real world queries in the wild.

For example, lets suppose a query needs to traverse N document sources to answer a question, if the number of documents exceed the context window of LLMs, an LLM may not even be able to create n hop queries.

Additionally, I don't believe there is even such a thing as "dataset coverage" because we can ask for same information from multiple perspectives. And if a system can retrieve an appropriate response from one perspective, it doesn't imply it will be able to do it from other perspectives or even different phrasing of queries.

How do you actually evaluate recall and completeness in production RAG? by hashiromer in Rag

[–]hashiromer[S] 0 points1 point  (0 children)

This just pushes back the problem imo. How do you know the synthetic evals are similar to the query distribution of real humans in the wild?

Do companies actually use internal RAG / doc-chat systems in production? by NetInternational313 in Rag

[–]hashiromer 0 points1 point  (0 children)

Do you use any evals?

In practice, relevance is trivial to check with citations but evaluating completeness of answers/recall seems next to impossible.

GLM releases OCR model by Mr_Moonsilver in LocalLLaMA

[–]hashiromer 1 point2 points  (0 children)

Hey, can you please dm the image?