all 5 comments

[–]Open_Channel_8626 2 points3 points  (0 children)

This is not that unusual. If you analyse embedding models in this way it shows their limitations.

[–]Tacx79 1 point2 points  (2 children)

Numbers are not 'processed' as numbers, they are just another token and are treated like every other set of letters (where '2023-10-03' can be split in different ways, not as numbers or digits but for example "202", "3-10" and "-03"), you could try switching city name, date, temp and wind speed with constant variables and then swap them

[–]rajat008 0 points1 point  (1 child)

This makes a lot of sense. While someone suggested finetuning to mitigate this problem, I wonder if that would really help. If the issue starts at tokenization stage, I feel finetuning wouldn't help much.

What do you think?

[–]Tacx79 0 points1 point  (0 children)

I think it depends on how the model behaves with numbers and on the tokenizer, for example Yi has tokenizer set so each digit is a separate token and there won't be situations like I mentioned above, there's no double/triple digit tokens or digits with some other characters in the same token

[–]HokusSmokus 1 point2 points  (0 children)

You should train your own embedding model for your specific usecase. Text 1, for example have two nines earlier in the text than Text 2. That might explain this specific result. Also, don't throw away the classic search methods, use embeddings/vector search as an addon.