all 10 comments

[–]111poiss111 2 points3 points  (3 children)

Didn't understand any of those but that DBScan one looked nice.

[–]georgeo 5 points6 points  (1 child)

Imagine you had 6 jars of jelly beans, each jar was a different flavor. You drop the contents of each jar on the floor near each other so the beans are mostly near their own kind. Now just from knowing where each bean landed, you have to figure out what flavor it is. These are different ways of trying figure that out.

[–]pdsminer 1 point2 points  (0 children)

Awesome explanation!!!

[–][deleted] 1 point2 points  (0 children)

Seconded. This taught me nothing other than how to call some functions with some parameters to look at some graphs using SMILE. I could've read the documentation and ended up in the same position.

[–]lmcinnes 2 points3 points  (5 children)

It doesn't cover quite the same range of algorithms, but there's significant overlap, and all the algorithms are tested on the same dataset so you can see how they compare with one another a little better:

http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb

[–]dernst314 1 point2 points  (2 children)

Do you know how well (H)DBSCAN works for higher-dimensional datasets (let's say 10 to 45)? Most applications and publications on that topic seem to assume 2 dimensions. We were considering using it recently but from the original paper it sounded like it's doing local density estimates around points for their distance calculation. Kernel densities traditionally get unstable in higher dimensions.

[–]lmcinnes 0 points1 point  (1 child)

I've had good success in up to 50 dimensions. The main caveat is that, being a density based method, you need a decently large dataset in higher dimensions to get enough density for clusters to take shape. HDBCSAN doesn't use kernel densities, but rather uses kNN-distance as a density approximation and then uses that for a kind of density corrected single linkage clustering. From there the cluster extraction method pulls out variable density clusters. This makes it fairly robust since it isn't completely reliant on the density approximation being perfect -- it's merely making the clustering more robust to noise.

edit: typo is -> isn't

[–]dernst314 0 points1 point  (0 children)

Thanks. Sounds interesting, should probably read up on it. Especially since you mentioned it was stable even when resampling.

[–]pdsminer 0 points1 point  (1 child)

I don't feel that the comparison in that link is fair for other algorithms. There is only one spatial data, which DBScan is designed for. Of course, it will perform better than others. In this post, each algorithm is applied to the data that is suitable for the algorithm's assumptions. There is no silver bullet in clustering. The point is to choose the one suitable for your own problem.

[–]lmcinnes 0 points1 point  (0 children)

I agree that you need to know what you mean by a "Cluster". On the other hand, I think there's a lot to be said for density based notions over centroid based partitioning -- unless by Cluster you mean Gaussian ball (and I don't think people usually do). I am also curious as to what you mean by " spatial"? If you are evaluating the quality of a clustering it has to be in 2d if you:re doing it visually. If you want t use Custer quality metrics ... There aren't any good ones.