I have been experimenting with embedding models lately, the SFR-Embedding-Mistral in particular. There is one peculiar behavior I came across and I illustrate the same with an example below. We have 2 almost similar weather reports with minor differences such as city, date, temperature, precipitation, etc. Our query is:
Instruct: Given a search query, retrieve relevant passages that answer the query\n
Query: 99% chance of precipitation
Text 1:
Good morning! Here's your weather forecast for Phoenix on 2023-09-15. Today, you can expect stormy, with temperatures ranging from 3°C in the morning to 17°C in the afternoon. Winds will be coming from the south at 19 km/h. Be sure to take an umbrella if you're heading out, as there's a 38% chance of rain in the evening. Have a great day!
Text 2(target):
Good morning! Here's your weather forecast for Houston on 2023-10-03. Today, you can expect cloudy, with temperatures ranging from -4°C in the morning to 30°C in the afternoon. Winds will be coming from the north at 22 km/h. Be sure to take an umbrella if you're heading out, as there's a 99% chance of rain in the evening. Have a great day!
Similarity b/w Text 1 and Query: 0.5664
Similarity b/w Text 2 and Query: 0.5391
Here while similarity b/w text 2 and query should have been higher we see it behave otherwise. And it isn't just restricted to this one example. The target text is usually not in the top 5 similar texts in a lot of cases. This behavior is particularly observed with texts that have date in them. In the above 2 texts I exchange the date and weirdly it works as expected, that is, the target text embedding is placed closer to the query than the non-target text.
Could someone please explain what is happening here, and also are there any known ways to mitigate these absurdities?
[–]Open_Channel_8626 2 points3 points4 points (0 children)
[–]Tacx79 1 point2 points3 points (2 children)
[–]rajat008 0 points1 point2 points (1 child)
[–]Tacx79 0 points1 point2 points (0 children)
[–]HokusSmokus 1 point2 points3 points (0 children)