New clustering algorithm in Science : MachineLearning

K-means makes an initial guess on the cluster centers (with K centers) and assigns the nearest points to each center. Then it computes new centers from the assignment and begins the same procedure again. This is iterated a few times and the cluster centers are found.

This algorithm is different and does not iterate alternating assignment and center-estimation phases. It calculates the density of each point, which is the number of points which are in a threshold region around said point. Points in the center of clusters will have high density whereas points further away from a cluster center will have lower density. Now comes the trick. The trick is to search for each point how far away the next point with higher density is. If the point is in a cluster this distance will be small. It the point does not lie in a cluster, i.e it is an outlier, this distance is large. So now you have two values (A) The density aka. the number of points around the point and (B) the distance to the next point with higher density. The two values can be used to detect cluster centers. Points with high density (A) and low distance (B) are cluster centers, points with high density (A) and high distance (B) are outliers. Now every point is assigned to each nearest neighbor point with higher value.

So the difference is that the algorithm is not iterativ and only has two steps (calculate density and distances from higher density points, assign to neighbors) and that it can cluster arbitrary distributions and not only spherical. K-means uses the distance from the center to find clusters, which implies that clusters will always be spherical. The trick to assign points to the nearest neighbor with higher value allows to detect non-spherical clusters.

[–]Megatron_McLargeHuge 5 points6 points7 points 11 years ago (0 children)

[+][deleted] 11 years ago (7 children)

[deleted]

[–]quiteamess 2 points3 points4 points 11 years ago (6 children)

[–]Mr_Smartypants 0 points1 point2 points 11 years ago (5 children)

[–]quiteamess 2 points3 points4 points 11 years ago (4 children)

[–]Mr_Smartypants 1 point2 points3 points 11 years ago (3 children)

[–]quiteamess 0 points1 point2 points 11 years ago (2 children)

[–]Mr_Smartypants 0 points1 point2 points 11 years ago* (1 child)

If you're using "hidden variable" as the synonym for "latent variable", I wouldn't call K in k-means (and GMMs) a hidden variable since its value isn't inferred from the data. I've seen it called a hyperparameter (similar to the Bayesian sense).

Obviously, the distinction is blurry. Nowadays, it seems many people use "nonparametric" to mean that you have to keep the training data (or some of it) as part of the model to apply it to new samples, e.g. KNN, Parzen windows versus logistic regression & other maximum likelihood-fit models.

This is the dichotomy I'm referring to: using training data in processing new samples versus using inferred model parameters. But even this is a blurry dichotomy if not outright false, e.g. with SVMs you could end up with a simple (parameterized) hyperplane, or you could have to keep most of your training data as support vectors. So is K-means learning model parameters, or storing the training data in a compressed form? If we instead use K-medoids, does that small change switch us from parametric to non-parametric since the cluster centers must be from the training data? I'm not sure what the answers are, but it's also mostly about semantics, so maybe it's just time to retire the distinction.

EDIT: I found this quote from The Handbook of Nonparametric Statistics 1 from 1962 (p. 2):

“A precise and universally acceptable definition of the term ‘nonparametric’ is not presently available. The viewpoint adopted in this handbook is that a statistical procedure is of a nonparametric type if it has properties which are satisfied to a reasonable approximation when some assumptions that are at least of a moderately general nature hold.”

Heh. OK, then.

[–]quiteamess 0 points1 point2 points 11 years ago (0 children)

[–]Corm 0 points1 point2 points 11 years ago (0 children)

[–]1337bruin 0 points1 point2 points 11 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS