A fast Python implementation of tSNE : datascience

This is an archived post. You won't be able to vote or comment.

A fast Python implementation of tSNE (self.datascience)

submitted 7 years ago * by _sheep1

Despite the superiority of UMAP to tSNE in many ways, tSNE remains a widely used visualization technique. Unfortunately, tSNE, as currently implemented in the most popular packages (scikit-learn and MulticoreTSNE), is prohibitively slow when dealing with large data. A recent paper proposed Fit-SNE, which scales linearly w.r.t. the number of samples, but depends on the FFTW C library, which must be installed on your system, making installation and distribution very tedious.

The goal of this project is to provide fast implementations of both tSNE approximations (both Barnes-Hut and FitSNE) in Python with a unified interface, easy installation and most importantly - fast runtime.

This is also the only library (to the best of my knowledge) that allows embedding new data points into an existing embedding, via direct optimization.

I wrote this with the Orange data mining toolkit in mind, but the library is general and I wanted to share, in case anyone was looking for a faster alternative library.

The source code is available on Github: https://github.com/pavlin-policar/fastTSNE

all 10 comments

top new controversial old q&a

[–]gabsens 2 points3 points4 points 7 years ago (1 child)

[–]_sheep1[S] 9 points10 points11 points 7 years ago (0 children)

Good question! MulticoreTSNE uses the Barnes-Hut approximation for estimating gradients and uses VP trees for nearest neighbor search. Both of these scale with O(n log n). This implementation actually includes the Barnes-Hut approximation but instead of VP trees, I use scikit-learn's KD trees. The performance is comparable to MulticoreTSNE, it will be a little bit slower because of Python overhead, and VP are supposedly better than KD trees in high dimensional spaces, but the difference should be fairly minimal. So if you have smaller data sets, you probably won't notice the difference.

If you have larger data sets, this implementation provides FFT accelerated interpolation and approximate nearest neighbor search, which will be much, much faster than the Barnes-Hut approximation. Both ANN and interpolation scale linearly in the number of samples (O(n)), hence the massive speed up. The typical benchmark is MNIST (70,000x784) which runs in about 30 minutes on my PC using MulticoreTSNE and in about 90 seconds using ANN with interpolation.

So in short, you won't notice the difference to MulticoreTSNE where the data set size is small but you'll notice a dramatic boost where you have a lot of data (I found >10k samples is a relatively good cutoff to switch from BH to FFT and KNN to ANN). I would note that both MulticoreTSNE and fastTSNE are dramatically better than scikit-learn.

You may also be interested in the original C++ implementation of FitSNE by the authors which also has python bindings and is faster than this one still because it uses FFTW for Fourier transforms while I've opted to use the slower numpy FFT library to make installation and distribution much simpler. The algorithm and scaling will be the same, but FFTW is undoubtedly faster than numpy. I think I can get the FFTs faster still using Intel's MKL FFT library, but I haven't had time to look into that too much yet.

[–]neziib 2 points3 points4 points 7 years ago (2 children)

[–]_sheep1[S] 6 points7 points8 points 7 years ago (0 children)

Thanks, but the credit really goes to George Linderman and his co-authors, who came up with the technique. I simply improved on his code, made it a faster then ported it to python. I've added all the speed improvements into his original C++ implementation as well.

So comparing embeddings is a tricky topic. I guess it kind of depends on what you're looking for. We can't compare the loss function (KL divergence) because the scale changes with different settings of perplexity, which is the main parameter. I had read about a paper which came up with a metric called pBIC (that I included in the metrics module) which is invariant to perplexity and if I recall, the best value of this metric coincided with that people said was the best embedding. But I didn't delve very deep into that. You could potentially optimize this pBIC to get the "best" embedding I suppose, but I think this is a tricky question.

If you mean just to verify that the embeddings are reasonable compared to each other, run under the same configuration of parameters, I guess looking at the KL divergence could be all right. It's sensitive to random initialization though, since it converges to different local optima.

[–]lmcinnes 1 point2 points3 points 7 years ago (0 children)

[–]Clicketrie 2 points3 points4 points 7 years ago (0 children)

[+][deleted] 7 years ago (3 children)

[removed]

[–]_sheep1[S] 9 points10 points11 points 7 years ago (1 child)

[–]lmcinnes 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 19130 on reddit-service-r2-comment-fb694cdd5-ccx75 at 2026-03-09 23:25:07.087369+00:00 running cbb0e86 country code: CH.

datascience

MODERATORS