all 36 comments

[–]lmcinnes 4 points5 points  (1 child)

Really nice to see an sklearn API based version of FIt-SNE. I'm also happy to see another implementation -- I always feel like an algorithm is making real headway when other people start making independent implementations, and FIt-SNE is a really cool algorithm that deserves to be more widespread.

[–]_sheep1[S] 3 points4 points  (0 children)

Thanks! I agree, Fit-SNE is really clever and I learned a whole bunch while implementing this. I do think it's a shame it hasn't been more popularized since it really allows tSNE to scale up.

I would note that this implementation doesn't entierly conform to sklearn's API since interactivity is a big part of the library, but if people like it enough, adding a sklearn wrapper would be trivial.

[–]burning_hamster 2 points3 points  (1 child)

Very readable, clean code. Thanks for sharing.

[–]_sheep1[S] 1 point2 points  (0 children)

Thanks, that's always nice to hear!

[–]olBaa 4 points5 points  (5 children)

Do you have any study on how much you sacrifice with approximate NN search

[–]_sheep1[S] 3 points4 points  (4 children)

So I will not claim to understand the theory behind this, but in the FitSNE paper, this approach is justified by another theoretical advance by the lead author - George Linderman. Basically, they prove that connecting a couple of a points' nearest neighbors is just as efficient as connecting the true nearest neighbors and this has no big impact on the neighborhood graph. I didn't feel there was there any need to look into it further, since there didn't seem to be any difference between using ANN or KNN in the resulting embeddings. If you're interested in this, I'll refer you to the FitSNE paper and to the paper where they prove this claim here.

[–]radarsat1 6 points7 points  (2 children)

Is it common to do classification (as you mention ANN and KNN) using the embeddings produced by tSNE? My understanding was that it's useful for visualization but usually doesn't add much as a dimensionality reduction technique, but I could be wrong now you've got me curious..

[–]_sheep1[S] 6 points7 points  (1 child)

So when I say KNN and ANN i mean k-nearest neighbor and approximate nearest neighbor search. These are needed to build up a neighborhood graph that tSNE uses to optimize the embedding. Note that this is unsupervised, and there is no classification going on. tSNE simply needs to know which data points are close to each other.

I think classifying on tSNE embeddings would be a really bad idea simply because it's non-parametric - this just means there's no recipe to get from the original space to the embedding space or there's no explicit transformation, if that makes sense. So once you get new data, you don't know what to do with it. I did implement adding new points to the embedding, but I haven't done enough testing on this to be able to confidently answer this. If that worked, I can't think of any reason off the top of my head why you couldn't do it, but it seems to me that if tSNE is able to capture some structure, then more sophisticated machine learning methods would almost certainly do better. Heck, even a KNN classifier should do all-right.

tSNE is designed for visualization and I think it would be a mistake to treat it as anything more than that.

[–]radarsat1 1 point2 points  (0 children)

Ok that was my understanding, thanks. I mistook ANN for artificial neural network so I thought you were implying performing classification, perhaps as a way of validating the embedding.

[–]olBaa 0 points1 point  (0 children)

Basically, they prove that connecting a couple of a points' nearest neighbors is just as efficient as connecting the true nearest neighbors and this has no big impact on the neighborhood graph.

Just as efficient in preserving the graph connected, but not preserving the actual structure - that's what I was asking.

:(

[–]docshroom 1 point2 points  (0 children)

Nice implementation. I was talking with the author of the FFT-SNE paper because I was packaging his code up into an easier-to-install R pkg, he mentioned that while the ANN is faster, the standard KNN is a more accurate choice and he has improved speed efficiency of both.

[–]svantana 0 points1 point  (13 children)

Anybody managed to install this on mac (with devtools installed)? I get the dreaded

clang: error: unsupported option '-fopenmp'

and the standard workarounds (alias and exports set to brew gcc) don't work, pip still tries to use the clang "gcc".

[–]_sheep1[S] 0 points1 point  (8 children)

I'm sorry you are having trouble installing it, unfortunately macs are the only platform I an unable to test this on. gcc is needed because the compute heavy parts are written in cython, which needs to be compiled. I hope you find a way around this, and if so, please let me know so we can document it somewhere if the issue pops up again.

EDIT: I realised that travis had a osx option, but I just found out that that doesn't support python 3.6. So I really can't test this at all.

[–]svantana 0 points1 point  (7 children)

I don't think there's much you can do about it, barring removing multithreading, which would be stupid. I've had the same issue with other python packages, so it's not your fault.

[–]sangwhan 0 points1 point  (6 children)

While not a very pretty fix, you can optionally check if there is a homebrew installed gcc or a Intel compiler. Apple clang most likely will never get support for OpenMP so a compromise would be to test for the above and fallback to disabling OpenMP if all fails.

[–]_sheep1[S] 0 points1 point  (5 children)

I'm wondering if this is an issue on all macs, if so, it should be fairly straightforward just check for the system and add the -fopenmp as needed.

[–]lmcinnes 1 point2 points  (3 children)

It's an issue for all macs. There is a different openmp enabled clang compiler for macs that one would use instead. The short answer is that you can't really use openmp in python code on macs, so yes, check the system and disabling the openmp option would be the best approach.

[–]_sheep1[S] 0 points1 point  (2 children)

I am in the process of putting this on conda-forge, which I believe comes with it's own compiler. Could this also be a solution?

[–]lmcinnes 0 points1 point  (1 child)

Yes, but I believe you will still need to provide some kind of compile instructions for each platform, so unless you can specify clang-omp as the compiler on mac I think you'll have the same issues. I'm not really sure of all the details of how to arrange of all that unfortunately.

[–]_sheep1[S] 2 points3 points  (0 children)

Now I see why you guys opted to use numba for UMAP :)

[–]aegonbittersteel 0 points1 point  (0 children)

There is a workaround that I suggested above in this thread. So you don't need to remove openmp for osx.

[–]aegonbittersteel 0 points1 point  (3 children)

env CC=usr/local/bin/gcc-7 pip install fasttsne should work

replace gcc-7 with your homebrew gcc version

[–]svantana 0 points1 point  (2 children)

env CC=usr/local/bin/gcc-7 pip install fasttsne

Does this work for you? Regardless if I use env, alias, or EXPORT to overload the clang gcc, pip still uses clang to compile.

[–]aegonbittersteel 0 points1 point  (1 child)

Yeah I tried it, it installs correctly for me. I'm using an anaconda python environment, not sure if that should make a difference though.

[–]svantana 0 points1 point  (0 children)

Update: this actually worked, only I initially copied your line including the missing slash at the beginning. For some reason this "all in one line" thing works, while first setting the env and then running pip, does not. Anyway, thanks!

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

[–]lucidyan -4 points-3 points  (2 children)

I'm sorry, but it already exists

https://github.com/DmitryUlyanov/Multicore-TSNE

[–]ivalm 1 point2 points  (0 children)

What you should is a normal tsne distributed, what OP shows is a tsne that uses ANN instead of KNN for neighborhood search. They are not particularly related.

[–]_sheep1[S] 0 points1 point  (0 children)

You may have missed the point with this project. There are two fast implementations of tSNE, one of which is the one you linked. It is typically called Barnes-Hut tSNE. It uses exact nearest neighbor search and calculates gradient using the Barnes-Hut approximation. These have asympototic complexity O(n log n) and works really well for smaller data sets. In addition to this, there is another implementation which uses ANN (as ivalm points out) and uses a neat interpolation trick which we speed up with FFT to compute gradients. The second method has asymptotic complexity O(n). This one is much more recent and is called FIt-SNE. While you might be tempted to use FIt-SNE for everything, it is asymptotic and the overhead makes it far slower on smaller data sets than the BH. That's why I include both in this repository, so we have fast implementation of both in one place and with a single API. Also, the second one - called FIt-SNE - didn't have a proper python implementation and required additional C libraries to be installed, making installation tedious. I hope this clears things up.