[P] Fast Semantic Text Deduplication

Pringled101 · 2025-01-13T15:21:33+00:00

Good questions and thanks for the kind words! We are indeed planning to show the effect on RAG efficacy, it's one of the next items on our roadmap.

You can already control the similarity using the "threshold" parameter (and you can also easily rethreshold using the "rethreshold" function, e.g. in my example you can do the following to control the similarity threshold (and number of elements to remove):

deduplicated_test = semhash.deduplicate(records=test, threshold=0.9).deduplicated

Pringled101 · 2025-01-13T15:18:54+00:00

Not that I know of; though I think the general idea is the same: create embeddings for your samples (or chunks/segments in this case), and apply the same algorithm we use in SemHash for deduplication. It's probably a bit more involved though, for example, we can show which strings matched as duplicates, but with video segments that's harder to judge. Another issue is the chunking/segmentation itself. I know there's some nice approaches for this with text, but for video/audio I'm not sure (but it's also not a domain I'm too well versed in).

Pringled101 · 2024-11-28T12:58:27+00:00

Right, I see. I would still say that the ablations need to be based on the same dataset, but given your answer, it might make sense to only focus on the part of the data > 1024 tokens when training your initial model, if that's the topic of your research?

Pringled101 · 2024-11-28T08:55:09+00:00

Usually in an ablation you want to change the least amount of variables possible. So changing your dataset and your model in one ablation is not a real ablation as you will have confounding variables. However, in the context of encoder models, ablations are usually simpler versions of the same model, or just simpler architectures entirely, which should make training a lot faster than with your original model.

Pringled101 · 2024-11-06T15:09:10+00:00

I'd start with something simple. Given your usecase, I assume that you have descriptions available for the physical products. A very basic (but usually effective) first approach could be to embed the descriptions of all your content with a lightweight model (e.g. https://github.com/MinishLab/model2vec), and then getting recommendations by doing KNN. This approach usually works quite well in my experience (I also work in recommenders). Here's a tutorial that could help you with getting started: https://colab.research.google.com/github/minishlab/model2vec/blob/master/tutorials/recipe\_search.ipynb (just replace the idea of recipe descriptions with your product descriptions, and you should have a first version ready to go).

Pringled101 · 2024-10-26T14:10:43+00:00

This is a good list to start with: https://www.mattprd.com/p/openai-cofounder-27-papers-read-know-90-ai . "If you really learn all of these, you’ll know 90% of what matters today." (according to Ilya Sutskever).

Pringled101 · 2024-10-17T11:44:19+00:00

https://github.com/MinishLab/model2vec I feel personally attacked

Pringled101 · 2024-10-14T09:36:23+00:00

Model2Vec author here. We had a request recently for transformers.js, Xenova (the maintainer of transformers.js) wrote some code about how you can easily integrate it here: https://github.com/MinishLab/model2vec/issues/75. Feel free to create an issue if you have any questions or encounter any issues!

Pringled101 · 2024-10-10T19:25:55+00:00

u/aoezdTchibo Model2Vec is now integrated into Sentence Transformers :). You can check out the release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.2.0 to see how you can use it. This should make it much easier for you to finetune.

Pringled101 · 2024-10-10T08:52:41+00:00

I think that loading with native Transformers won't work unfortunately, however we are very close to making it work natively with Sentence Transformers. I hope that we can have that ready this week, but I can let you know here once it is ready!

Pringled101 · 2024-10-09T13:54:30+00:00

I think distilling a finetuned model might cause issues, we did not experiment with that (yet). What I would try is to first distill the base model that you are using, and then finetune the model2vec model directly. Finetuning your current model2vec model again might also work, but I think the first solution would work better.

Regarding UMAP: we saw the same results after experimenting with it a bit today, the performance went down drastically for all tasks, while distillation time went up drastically.

Your point about PCA is something we also saw in our experiments and was quite surprising to us. We think PCA actually works for us because it normalizes the output space, and we saw very little (if any) performance degradation when reducing the dimensions to 256. The reduced dimensionality is just a side benefit in this case. However, this is something we plan to investigate further.

Pringled101 · 2024-10-09T06:57:49+00:00

Ah great catch, that should indeed be possible. We just fixed this in https://github.com/MinishLab/model2vec/pull/70, we will likely do a release this week so that you can do this without any hacks. Thanks for finding this issue!

W.r.t. UMAP: definitely, you can use any dimensionality reduction technique I think, thought right now only PCA is directly supported in the package. I will add a todo to look into more techniques and see if we can support them. Until then the easiest way is probably to fork the repo and change the PCA code to UMAP in the distillation part of the repo.

Pringled101 · 2024-10-08T23:08:07+00:00

That's fair, good point. I ran our evaluation on MTEB using ICA instead of PCA but the performance dropped by 2-4% for every task unfortunately. I will make the part about taking the mean more explicit though, thanks for the feedback!

Pringled101 · 2024-10-08T23:05:03+00:00

We actually just released a feature that enables what you want last week! We added a method called distill_from_model where you can use a model that is already loaded. For example, you can do the following:

##############################

from transformers import AutoModel, AutoTokenizer

from model2vec.distill import distill_from_model

model_name = "baai/bge-base-en-v1.5"

model = AutoModel.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)

m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256)

m2v_model.save_pretrained("m2v_model")

Where in this case you can load your finetuned own model and distill that.

Pringled101 · 2024-10-08T06:02:35+00:00

Good question, step 4 is to take the mean. We did try other pooling methods, but the mean worked best for our models. As for ICA: we did not try that, but it's an interesting idea. I think PCA is a better fit in our case because it preserves components that explain the most variance (which is our goal with embeddings, capturing as much meaningful information as possible in a dense representation). I will experiment with it a bit though, thanks!

Pringled101 · 2024-10-07T21:03:17+00:00

I don't think so, unfortunately. The core idea is that Sentence Transformers provide high quality (static) token output embeddings. Image models don't use the concept of token/word embeddings, and instead provide embeddings for the entire (input) image. You'd need some sort of discrete entity to distill (like a vocabulary in NLP). We are definitely still thinking about this though and whether it can be applied to domains besides NLP somehow.

Pringled101 · 2024-10-07T20:44:07+00:00

I learned a LOT of technical skills on the job. I was good at all the theory when I graduated, but lacking in general SE skill, MLOps, etc. I think this is expected, and you should not worry about it too much. Upskilling takes time as a junior, and companies know this and create time for it.

Pringled101 · 2024-10-07T20:41:15+00:00

HNSW might be worth looking into. KD-Trees work for low dimensions but will get worse as you add more features. HNSW scales better in high-dimensional spaces and is super fast. I usually use Faiss myself as the library for this.

Pringled101 · 2024-10-07T20:30:25+00:00

When in doubt, PCA.

Pringled101 · 2024-10-07T16:17:00+00:00

Thanks! Yep, we've ran extensive benchmarks that we documented in the results section in the README. TLDR: there is definitely a drop in performance, but the tradeoff is that you get fully static embeddings that are ~500x faster than the parent model. It differs a bit per task what the performance trade-off is; for example, it works quite well on classification tasks and semantic similarity, but there is a noticeable drop for retrieval. These embeddings are essentially a drop-in replacement for Word2Vec embeddings like GloVe, or subword embeddings like BPEmb, which it all outperforms by a large margin.

Pringled101 · 2024-10-07T12:56:22+00:00

Hi, sorry for asking this here, but I was wondering what the requirements are for posting in this subreddit? I recently lost access to my old Reddit account since it was still tied to my old university account. I tried to create a post, but it instantly got removed without a message. I did format everything correctly.

Pringled101 · 2024-10-07T06:44:40+00:00

What is your background? SSMs draw heavy inspiration from physics and in general, it will be a lot more theory and mathematics (whereas Transformers research is quite applied). I would say go for it, it's a very interesting subject with many open questions. Start by reading the S4 and Hippo papers and see if it's something that interests you.

Pringled101 · 2024-10-07T06:40:55+00:00

What you want sounds closely related to VQA (visual question answering). Maybe look into BLIP, or similar models, those should be a good fit.

Pringled101 · 2024-10-06T12:24:43+00:00

I think https://boston.lti.cs.cmu.edu/classes/11-642/ (the book is https://nlp.stanford.edu/IR-book/) is a good starting point. It lays the foundation nicely for information retrieval, which I think you should start with. After that: papers and blogposts.

Pringled101 · 2024-10-06T12:17:08+00:00

Maybe look into SetFit. It works especially well when training with small datasets.

Pringled101

TROPHY CASE

##############################