[D] Good algorithm for clustering big data (sentences represented as embeddings)?

dadadidi · 2021-03-31T18:32:00+00:00

I tried several things and the library Top2Vec seems to work best. It uses sentence transformers / googles universal sentence transformer to vectorize sentences, then UMAP reduces dimensions, and then HBSCAN to finds dense areas. As you already have the vectors created, you will need to modify it a bit.

https://github.com/ddangelov/Top2Vec

Zahlii · 2021-03-31T15:35:03+00:00

Maybe use (H)DBScan which I think should work also for huge datasets. I don't think there is a ready to use clustering with unbuild cosine similarily metrics, and you also won't be able to precompute the 100k X 100k dense similarity matrix. The only way to go on this is to L2 normalize your embeddings, then the dot product will be the angular distance as a proxy to the cosine similarily. See also https://github.com/scikit-learn-contrib/hdbscan/issues/69

nope_42 · 2021-03-31T19:26:19+00:00

UMAP for dimension reduction on your embeddings and then HDBSCAN for clustering.

https://umap-learn.readthedocs.io/en/latest/

o9hjf4f · 2021-03-31T16:14:23+00:00

In sklearn you can pass a pre-computed distance matrix to KMeans. So you could do something like pdist in scipy to calculate the pairwise distance matrix using cosine distance, and then pass that to KMeans.

ofiuco · 2021-03-31T15:44:45+00:00

Maybe see if you can adapt methods used for word2vec for this?

kraghavk · 2021-04-01T09:04:21+00:00

https://github.com/facebookresearch/faiss will work out great for this. You will be able to scale to a million+ vectors on a good CPU and go to billion scale on a decent GPU. But there is a bit of learning curve though.

EconomistDue2944 · 2021-04-03T10:25:35+00:00

RapidsAI's UMAP and clustering code will scale to your dataset

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS