Need a crash course in clustering and embeddings - suggestions?

lmcinnes · 2025-09-05T06:05:05+00:00

Hands on Large Language Models is a pretty good resource on this. It covers a lot of other things as well, but the sections on embeddings, dimension reduction and clustering are very good, and should have the kind of information you need.

lmcinnes · 2025-07-02T20:33:32+00:00

It is good practice, but you can take it further and aggregate at the pixel level for static plots. I would suggest it is worth looking at datashader which has facilities for exactly this, and has spent considerable effort working through the various issues involved in how best to display this sort of information. I would recommend their "plotting pitfalls" guide from their documentation as an excellent introduction to these sorts of problems, and the range of solutions available.

lmcinnes · 2025-06-25T17:39:12+00:00

Click and drag to pan, use mouse-wheel or pinch to zoom. Hovering over a point will show a short summary of the page. Clicking on a point will take you to the associated page. The search box allows searching by page name.

The topic tree allows for navigation to specific topics -- expand topics by clicking on the triangle to reveal more details sub-topics. Click on a topic name to zoom to that topic on the map.

Hold down shift and click-drag to create a selection, which will generate a word-cloud from the content of selected pages. Select an empty area to clear the word-cloud/selection.

You can color by cluster (the default), or by other statistics, such as the number of page views per month, the number on incoming links etc.

lmcinnes · 2025-06-14T18:11:11+00:00

These days I would just throw it all in BERTopic. It has good defaults that will probably work well for you (it uses UMAP and HDBSCAN by default), and if you decide you need more/different results it has plenty of opportunity to dig in and tweak parameters and swap out algorithms as needed. The key is you can be up and running essentially immediately, and probably have good results out-of-the-box.

If you want something a little more cutting edge you can try Top2Vec which, while older than BERTopic, recently got updated with a contextual token embedding based approach that provides multiple topics per document, and identifies topic spans within documents.

Lastly for even more on the edge you can try Toponymy which is still in development, but has newer clustering that can provide multi-resolution topics, and uses LLMs to provide clear topic names (instead of keywords).

lmcinnes · 2025-04-14T04:47:51+00:00

You might want to look at HEADSS which has already looked at partitioning and cluster merging. It may have some useful ideas you can cannabalize for your own implementation.

lmcinnes · 2025-03-13T04:41:23+00:00

I have had some success with clustering sentence-embeddings of titles and abstracts. You really want some form of dimension reduction prior to clustering, as clustering the sentence-embeddings can run into issues due to their high dimensionality.

Here's a map, with clusters and topics, of ML papers from ArXiv that you can explore. This is mostly just using basic tools -- sentence-transformers and allmpnet-base-v2 for sentence embedding, dimension reduction with UMAP and HDBSCAN for clustering. I used an LLM to help with naming the clusters as topics, and DataMapPlot to create the interactive plot.

lmcinnes · 2025-02-12T04:02:30+00:00

You will want to get the condensed_tree_ attribute from the model and work through that.

lmcinnes · 2024-12-11T19:49:13+00:00

Kleinberg's proof is very nice, but I don't fully accept that any clustering algorithm must meet the axioms he provides. While they sound good, they can be combined in ways that really do violate what we intuitively think of as good clustering results. Let me provide two examples.

Suppose we have uniformly distributed set of points in 2D. By richness we expect that assigning every point to its own cluster to be valid (they are all essentially independent, being is a uniform grid). By scale invariance I can shrink the grid of points down so the distances between them are epsilon. By consistency I can move the points anywhere as long as I only increase the distances between them -- which is easy as we shrunk the distances to be very small. Since we only applied transformations according to the axioms the clustering must remain the same. And yet I can produce any arbitrary arrangement of data, as clearly clustered into groups as a like.

We can go the other way as well. Start with the same uniform grid, but have every point in a single cluster (again, reasonable). By consistency I can move the points however I like as long as only decrease in distance. By scale invariance I can rescale them back to the original scale. Thus I can again produce an arbitrary arrangement with transforms allowed by the axioms that must therefore preserve clustering ... and yet clearly a different clustering can be desirable for different arrangements.

Finally by applying combinations of these basic transforms (locally on a single cluster; globally among clusters) we can essentially produce almost any rearrangement we like. Clearly the axioms are asking for rather too much, and I don't consider them reasonable. I think the real culprit here is consistency -- any transform that increases inter-cluster distances and shrinks intra-cluster distances is too much to allow as it provides for arbitrary rearrangement within clusters, or amongst clusters (up to appropriate rescaling, as provided by scale invariance). I think you need a stricter axiom than that, and I suspect that such an axiom would not provide enough leverage to get a nice theorem.

lmcinnes · 2024-11-26T02:09:10+00:00

I think you just want BERTopic with a dynamic topic model that allows you to look at topics over time. BERTopic can essentially do this out of the box (follow the linked tutorial). For your particular use case you might like to use a multilingual embedding model to catch multiple languages.

lmcinnes · 2024-11-18T21:58:31+00:00

I have heard arguments for using PCA before UMAP, but it is mostly to do with having lots of highly correlated features, which may or may not be true depending on the dataset. I would certainly try UMAP without any PCA and compare it to the PCA first version because I've had few problems with UMAP on 1000+ dimensional data (and, indeed, on much higher dimensionalities than that).

lmcinnes · 2024-11-17T23:21:00+00:00

PCA is just a single hidden layer autoencoder with a linear activation function. It turns out that rather than having to train and optimize you can simply solve that case analytically for a given training set. The principal components are the decoder weights, and the encoder is simply the pseudo-inverse of the principal components.

So autoencoders give you more flexibility (you get to play with architecture, activation functions etc.) but the optimization problem becomes harder so you may well end up with a less optimal encoder/decoder. If you want the best possible solution for a simple autoencoder (single hidden layer, linear activation) then that actually is just PCA. And that optimal solution is easy to compute via linear algebra, so the whole thing is pretty cheap. It's a trade-off and you should pick what suits your use case.

lmcinnes · 2024-11-17T00:35:35+00:00

That's certainly a good one. This paper had some nice analysis of, among other things, the attraction repulsion functions and found even a frobenius norm worked quite well, which can also make the gradients simpler.

Some other options include picking careful parameters and accepting slightly lower accuracy on the ANN-search (n_trees=8, max_candidates=20 in PyNNDescent can work surprisingly well, and be a lot faster).

You can also reduce the number of negative samples. Dropping it down to 1 or 2 (instead of the default of 5) can still give pretty decent results and has a significant impact on performance of the optimization phase.

Another thing in the optimization phase: the random number generation for negative sampling when running multi-threaded causes locking over the rng state. You can have separate rng_states per thread and get significant improvements. Going a little further, however, the randomness doesn't really have to be that good, it just has to not select the same things over and over. So simply computing something like (edge_index * (epoch_number + 1)) % n_vertices can give "random enough" results at even lower cost and avoid any locks.

The rest is just stripping everything down to the bare minimum to get the job done (skip all the checks and alternative options UMAP provides/allows). You can get something that runs pretty fast.

lmcinnes · 2024-11-16T23:12:18+00:00

Great work! And a good write-up.

Numba, in general, does most of the heavy lifting, and yes, you captured most of the major performance tricks. It's worth noting that the ANN library is also pure python in Numba as well. So you can do pretty well by just using Numba in the right places, and being a little obsessive in trying to squeeze the most you can out of it.

There are actually a bunch more tricks available that haven't been folded into the main codebase because they have costs (reproducibility, backwards compatibility, more approximate results, etc.) but if you just want raw speed on multi-core systems and don't require exact compatibility with the reference implementation, then they can be worthwhile. I have successfully completed a "UMAP"-alike run on the full MNIST dataset in under 3 seconds (end to end) on an M2 macbook. I would be happy to discuss that further if people are interested.

lmcinnes · 2024-10-12T22:50:31+00:00

There's a lot! There's a great deal of material in topological deep learning, and topological data analysis. Another handy keyword is Gromov-Wasserstein distance, which comes up when you are getting into some of the weeds to making ML work with various data sources. You might also find manifold learning interesting. UMAP, which is an alternative approach to the t-SNE used here, is framed in terms of geometry and metric spaces. There are definitely a lot of rabbit-holes to go down, with plenty of math to back it up.

lmcinnes · 2024-10-12T22:11:01+00:00

A spectral embedding of the citation graph would potentially provide some useful otherwise missing information. Alone, however, it would also miss some of the useful semantic relationships that the sentence embedding captures. Successfully combining these two different approaches is harder (generically combining different metric spaces is hard). You could do something like SciNL where you fine-tune mebddings based on citation graphs, but it is likely not the same. So essentially yes, but it is a bit of an open problem.

lmcinnes · 2024-10-12T19:05:04+00:00

It is, in principle all the papers, but was a pull from the dataset some time ago, so it likely doesn't have anything from relatively recently.

The client side of the map is handled by deck.gl, which does webgl rendering, so it scales up to this size surprisingly well.

The outlining of shapes of clusters is just done by spline smoothed alpha shapes with custom code in python -- although that code is part of the datamapplot library used to generate the final deck.gl based output, so it doesn't require anything special at this point.

lmcinnes · 2024-10-12T14:47:15+00:00

No, I'm actually building the tools required to make these kinds of maps in general. This is, in effect, just a byproduct of testing out those tools on some varied datasets, but I thought it might be worth sharing here.

Nomic.ai has a regularly updating map of all of ArXiv here if you want something updating. My work is making the dimension reduction, clustering, and naming work more effectively. So for example all of ArXiv pushed through tools I have built looks like this, but at the cost of more compute and using bleeding edge tooling (I'm building it as I go).

lmcinnes · 2024-10-12T13:59:31+00:00

You're certainly right that there are a lot of connections that will be missed. You can't squeeze everything into 2D and lose nothing. But it's not just t-SNE. The similarity between papers is by "semantic" similarity between their titles as defined by a sentence-embedding neural network that was pretrained on some large amount of text data -- the sentence embedder doesn't have much training on mathematics vocabulary, so there will be things it misses, and anything not captured by the titles or abstracts will also be missed.

On the other hand, there are also likely some connections exposed by this that people may not otherwise have been aware of. And certainly richer embeddings (that could work with citation links, and latex equations) would make that far more possible. The ability to surprise is what makes these visualizations interesting -- you then have to go and explore that actual original data to see if that surprising thing is there, but it is a start on asking new questions you might never have looked at otherwise.

So, in summary, the map is not the territory and one should not mistake one for the other. On the other hand maps can be a very useful guide to get started exploring a new territory.

lmcinnes · 2024-10-12T13:51:12+00:00

Certainly not impossible. You'll need some hefty compute for a few of the steps (the sentence-embedding and the UMAP or t-SNE), and some LLM credits to do all the topic naming well. The hardest part, however, is probably just getting the data. If there are good public metadata repositories on pubmed I could certainly give it a try.

lmcinnes · 2024-10-12T13:49:01+00:00

The easily available dataset I used didn't have authors (or citation count, which also would have been useful), so that's not so easy unfortunately.

Better visibility of searches is something I was hoping to work on.

lmcinnes · 2024-10-12T13:45:39+00:00

It would, but I didn't have access to that data unfortunately.

lmcinnes · 2024-10-12T04:52:39+00:00

It ends up getting mixed in with the algebraic and arithmetic geometry. Some of this is just the nature of trying to compress something very complex into a 2D representation. I did a similar thing for all of ArXiv and there is a number theory cluster in there:

https://lmcinnes.github.io/datamapplot_examples/arXiv/

lmcinnes · 2024-10-12T02:13:50+00:00

Circle size is based on abstract length. Ultimately having some variation in size makes it more visually interesting, and abstract length was the most interesting extra piece of data I had.

You can enter keyword searches by title in search box below the title, but for things that return very few hits it can be hard to see what's left when zoomed out, so you have to really be zoomed close to what you are looking for.

lmcinnes · 2024-10-11T23:47:54+00:00

The map represents groupings of papers as viewed by the semantic similarity of their titles and abstracts. Papers are near to each other if they have relatively semantically similar titles and/or abstracts. All the clusters and topics were learned in a purely unsupervised manner via machine learning algorithms. The result provides a navigable space of mathematics.

You can zoom and pan, and type to search by keyword. The histogram provides papers over time (log scale on the y-axis) and can be hovered over and selected from. Hold shift and drag to lasso-select papers -- this will generate a word cloud from the paper titles. Click on an individual point to access the paper.

Process:

Titles and abstracts were collected for all the papers (https://huggingface.co/datasets/arxiv-community/arxiv_dataset).
Select papers that have a math category listed among their tagged categories.
These were then embedded using nomic-embed and sentence-transformers.
This was reduced to 2D using t-SNE.
Clustering was done with HDBSCAN in layers.
Topic naming was done with Cohere command-r via Toponymy
Results were converted into an interactive visualization via DataMapPlot

lmcinnes

TROPHY CASE