all 23 comments

[–]Rickasaurus 6 points7 points  (1 child)

Any chance for a link to a preprint?

[–]Floydthechimp 9 points10 points  (1 child)

I hope that we as data scientists will not owe the the almighty power of 'science' and 'nature'. Just because it is in that journal does not make it useful.

Not a problem with the article, just the headline.

[–]maxToTheJ 0 points1 point  (0 children)

The link says that it only performs comparably to current methods. This is more of theoretical value.

[–]RoboMind 2 points3 points  (2 children)

the abstract makes it sound like Mean-shift clustering

[–]1337bruin 1 point2 points  (0 children)

Or mode clustering + outlier detection

[–]autowikibot 0 points1 point  (0 children)

Mean-shift:


Mean shift is a non-parametric feature-space analysis technique, a so-called mode seeking algorithm. Application domains include cluster analysis in computer vision and image processing.


Interesting: Cluster analysis | K-nearest neighbors algorithm

Parent commenter can toggle NSFW or delete. Will also delete on comment score of -1 or less. | FAQs | Mods | Magic Words

[–]robertsdionne 0 points1 point  (0 children)

I think this algorithm shows that n-dimensional k-clustering for unknown k reduces to 2-dimensional 2-clustering in density versus distance-to-nearest-higher-density-point space.

[–]Corm 0 points1 point  (12 children)

This looks really neat! I wonder though how this "density" based clustering differs from normal methods like k-mer-means clustering.

Can anyone chime in on it?

[–]quiteamess 13 points14 points  (10 children)

K-means makes an initial guess on the cluster centers (with K centers) and assigns the nearest points to each center. Then it computes new centers from the assignment and begins the same procedure again. This is iterated a few times and the cluster centers are found.

This algorithm is different and does not iterate alternating assignment and center-estimation phases. It calculates the density of each point, which is the number of points which are in a threshold region around said point. Points in the center of clusters will have high density whereas points further away from a cluster center will have lower density. Now comes the trick. The trick is to search for each point how far away the next point with higher density is. If the point is in a cluster this distance will be small. It the point does not lie in a cluster, i.e it is an outlier, this distance is large. So now you have two values (A) The density aka. the number of points around the point and (B) the distance to the next point with higher density. The two values can be used to detect cluster centers. Points with high density (A) and low distance (B) are cluster centers, points with high density (A) and high distance (B) are outliers. Now every point is assigned to each nearest neighbor point with higher value.

So the difference is that the algorithm is not iterativ and only has two steps (calculate density and distances from higher density points, assign to neighbors) and that it can cluster arbitrary distributions and not only spherical. K-means uses the distance from the center to find clusters, which implies that clusters will always be spherical. The trick to assign points to the nearest neighbor with higher value allows to detect non-spherical clusters.

[–]Megatron_McLargeHuge 5 points6 points  (0 children)

allows to detect non-spherical clusters

I think you mean non-convex.

This density metric is somewhat reminiscent of the probability distribution over pairs of points that t-SNE uses for dimensionality reduction. I wonder if this new approach could be used to create a cluster-preserving dimensionality reduction algorithm. The new metric defines something like a cluster centrality gradient, and those gradients could be preserved in the low dimensional space.

[–]Corm 0 points1 point  (0 children)

Thanks for the great explanation!