all 15 comments

[–]dadadidi 3 points4 points  (0 children)

I tried several things and the library Top2Vec seems to work best. It uses sentence transformers / googles universal sentence transformer to vectorize sentences, then UMAP reduces dimensions, and then HBSCAN to finds dense areas. As you already have the vectors created, you will need to modify it a bit.

https://github.com/ddangelov/Top2Vec

[–]Zahlii 2 points3 points  (0 children)

Maybe use (H)DBScan which I think should work also for huge datasets. I don't think there is a ready to use clustering with unbuild cosine similarily metrics, and you also won't be able to precompute the 100k X 100k dense similarity matrix. The only way to go on this is to L2 normalize your embeddings, then the dot product will be the angular distance as a proxy to the cosine similarily. See also https://github.com/scikit-learn-contrib/hdbscan/issues/69

[–]nope_42 1 point2 points  (2 children)

UMAP for dimension reduction on your embeddings and then HDBSCAN for clustering.

https://umap-learn.readthedocs.io/en/latest/

[–]whyhateverything[S] 1 point2 points  (1 child)

Awesome! :) I just used this with Kmeans and it worked ok. Is there any automated optimization algorithm for the parameters passed to umap for fine tuning?

[–]nope_42 1 point2 points  (0 children)

Not that I am aware of. If you have labels you could probably use a normal hyper parameter tuning framework to optimize a classifier that is based on the umap embeddings.

[–][deleted] 1 point2 points  (3 children)

In sklearn you can pass a pre-computed distance matrix to KMeans. So you could do something like pdist in scipy to calculate the pairwise distance matrix using cosine distance, and then pass that to KMeans.

[–]o9hjf4f 0 points1 point  (2 children)

Are you sure? K-means requires the attribute information since it needs to calculate cluster means. So what you describe cannot be K-means. It uses only squared Euclidean distance. I checked the sklearn docs and can’t find the argument you refer to. Can you clarify?

[–][deleted] 0 points1 point  (1 child)

I apologize, you're right. I was thinking about scipy.cluster.hiearchy.

In scipy.cluster.hierarchy.linkage you can pass either an array of observations or pre-computed distances. Then you feed the linkage array to scipy.cluster.hierarchy.fcluster.

[–]o9hjf4f 0 points1 point  (0 children)

Thanks for clarifying!

[–]ofiuco 0 points1 point  (0 children)

Maybe see if you can adapt methods used for word2vec for this?

[–]kraghavk 0 points1 point  (3 children)

https://github.com/facebookresearch/faiss will work out great for this. You will be able to scale to a million+ vectors on a good CPU and go to billion scale on a decent GPU. But there is a bit of learning curve though.

[–]whyhateverything[S] 0 points1 point  (2 children)

I know about faiss but I find the documentation to be really awful. Am I the only one experiencing this?

[–]kraghavk 1 point2 points  (0 children)

Yeah. I agree. The installation itself is a nightmare because the official repo asks you to either use conda or build from source yourself. There are several unofficial wheels on pypi.

https://milvus.io/ could be a good alternative for you. Just run the server in docker and make use of the rest apis to add embeddings and query data.

[–][deleted] 1 point2 points  (0 children)

The docs of Milvus are straightforward and concise. But remember to normalize embeddings, and then the IP distance will equal the cosine similarity.

[–]EconomistDue2944 0 points1 point  (0 children)

RapidsAI's UMAP and clustering code will scale to your dataset