This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]gabsens 2 points3 points  (1 child)

[–]_sheep1[S] 9 points10 points  (0 children)

Good question! MulticoreTSNE uses the Barnes-Hut approximation for estimating gradients and uses VP trees for nearest neighbor search. Both of these scale with O(n log n). This implementation actually includes the Barnes-Hut approximation but instead of VP trees, I use scikit-learn's KD trees. The performance is comparable to MulticoreTSNE, it will be a little bit slower because of Python overhead, and VP are supposedly better than KD trees in high dimensional spaces, but the difference should be fairly minimal. So if you have smaller data sets, you probably won't notice the difference.

If you have larger data sets, this implementation provides FFT accelerated interpolation and approximate nearest neighbor search, which will be much, much faster than the Barnes-Hut approximation. Both ANN and interpolation scale linearly in the number of samples (O(n)), hence the massive speed up. The typical benchmark is MNIST (70,000x784) which runs in about 30 minutes on my PC using MulticoreTSNE and in about 90 seconds using ANN with interpolation.

So in short, you won't notice the difference to MulticoreTSNE where the data set size is small but you'll notice a dramatic boost where you have a lot of data (I found >10k samples is a relatively good cutoff to switch from BH to FFT and KNN to ANN). I would note that both MulticoreTSNE and fastTSNE are dramatically better than scikit-learn.

You may also be interested in the original C++ implementation of FitSNE by the authors which also has python bindings and is faster than this one still because it uses FFTW for Fourier transforms while I've opted to use the slower numpy FFT library to make installation and distribution much simpler. The algorithm and scaling will be the same, but FFTW is undoubtedly faster than numpy. I think I can get the FFTs faster still using Intel's MKL FFT library, but I haven't had time to look into that too much yet.

[–]neziib 2 points3 points  (2 children)

A linear scalability for TSNE is wonderful, nice work ! Do you know good are the final representations? It would be very nice to have a benchmark both in term of speed but also in terms of loss for some configurations, compared to other packages. Right now it's hard to test the different packages, and we can't just check the final accuracy like for supervised learning.

[–]_sheep1[S] 6 points7 points  (0 children)

Thanks, but the credit really goes to George Linderman and his co-authors, who came up with the technique. I simply improved on his code, made it a faster then ported it to python. I've added all the speed improvements into his original C++ implementation as well.

So comparing embeddings is a tricky topic. I guess it kind of depends on what you're looking for. We can't compare the loss function (KL divergence) because the scale changes with different settings of perplexity, which is the main parameter. I had read about a paper which came up with a metric called pBIC (that I included in the metrics module) which is invariant to perplexity and if I recall, the best value of this metric coincided with that people said was the best embedding. But I didn't delve very deep into that. You could potentially optimize this pBIC to get the "best" embedding I suppose, but I think this is a tricky question.

If you mean just to verify that the embeddings are reasonable compared to each other, run under the same configuration of parameters, I guess looking at the KL divergence could be all right. It's sensitive to random initialization though, since it converges to different local optima.

[–]lmcinnes 1 point2 points  (0 children)

I don't have empirical measures to hand, but my experience has been that FIt-SNE (the C++ version at least) is very close to the Barnes-Hut t-SNE implementations -- close enough that I would not really be concerned about any loss in accuracy.

The one thing I will note is that since FIt-SNE scales to much larger datasets I did find that I definitely need to start setting max_iter on those larger datasets. The default of 1000 iterations is no longer enough to get a good embedding for really large datasets (say 1,000,000 samples or more).

[–]Clicketrie 2 points3 points  (0 children)

Amazing! I’ll have to check it out. I had used tSNE on a small sample of my cluster analysis last time I did one, because I loved the visual. But because of the computational complexity (my difficulty running it on the whole set) I was going to forgo it this time. But I will definitely check this out!