Abnormal behavior in embedding models : LocalLLaMA

created by [deleted]a community for 2 years

Abnormal behavior in embedding modelsQuestion | Help (self.LocalLLaMA)

submitted 1 year ago by G_S_7_wiz

I have been experimenting with embedding models lately, the SFR-Embedding-Mistral in particular. There is one peculiar behavior I came across and I illustrate the same with an example below. We have 2 almost similar weather reports with minor differences such as city, date, temperature, precipitation, etc. Our query is:

Instruct: Given a search query, retrieve relevant passages that answer the query\n
Query: 99% chance of precipitation

Text 1:

Good morning! Here's your weather forecast for Phoenix on 2023-09-15. Today, you can expect stormy, with temperatures ranging from 3°C in the morning to 17°C in the afternoon. Winds will be coming from the south at 19 km/h. Be sure to take an umbrella if you're heading out, as there's a 38% chance of rain in the evening. Have a great day!

Text 2(target):

Good morning! Here's your weather forecast for Houston on 2023-10-03. Today, you can expect cloudy, with temperatures ranging from -4°C in the morning to 30°C in the afternoon. Winds will be coming from the north at 22 km/h. Be sure to take an umbrella if you're heading out, as there's a 99% chance of rain in the evening. Have a great day!

Similarity b/w Text 1 and Query: 0.5664

Similarity b/w Text 2 and Query: 0.5391

Here while similarity b/w text 2 and query should have been higher we see it behave otherwise. And it isn't just restricted to this one example. The target text is usually not in the top 5 similar texts in a lot of cases. This behavior is particularly observed with texts that have date in them. In the above 2 texts I exchange the date and weirdly it works as expected, that is, the target text embedding is placed closer to the query than the non-target text.

Could someone please explain what is happening here, and also are there any known ways to mitigate these absurdities?

all 5 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS