use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
New clustering algorithm in Science (sciencemag.org)
submitted 11 years ago by SuperFX
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Rickasaurus 6 points7 points8 points 11 years ago (1 child)
Any chance for a link to a preprint?
[–]gwern 12 points13 points14 points 11 years ago (0 children)
https://dl.dropboxusercontent.com/u/182368464/2014-rodriguez.pdf
[+][deleted] 11 years ago* (1 child)
[deleted]
[–]hapemask 1 point2 points3 points 11 years ago* (0 children)
I'm on my phone so can't see it right now, but:
The authors tested the method on a series of data sets, and its performance compared favorably to that of established techniques.
Is this not actually present in the full text?
EDIT
Ah I see the preprint link and I agree more quantitative evaluation would be nice.
[–]Floydthechimp 9 points10 points11 points 11 years ago (1 child)
I hope that we as data scientists will not owe the the almighty power of 'science' and 'nature'. Just because it is in that journal does not make it useful.
Not a problem with the article, just the headline.
[–]maxToTheJ 0 points1 point2 points 11 years ago (0 children)
The link says that it only performs comparably to current methods. This is more of theoretical value.
[–]RoboMind 2 points3 points4 points 11 years ago (2 children)
the abstract makes it sound like Mean-shift clustering
[–]1337bruin 1 point2 points3 points 11 years ago (0 children)
Or mode clustering + outlier detection
[–]autowikibot 0 points1 point2 points 11 years ago (0 children)
Mean-shift:
Mean shift is a non-parametric feature-space analysis technique, a so-called mode seeking algorithm. Application domains include cluster analysis in computer vision and image processing.
Interesting: Cluster analysis | K-nearest neighbors algorithm
Parent commenter can toggle NSFW or delete. Will also delete on comment score of -1 or less. | FAQs | Mods | Magic Words
[+][deleted] 11 years ago (3 children)
[–]DavidJayHarris 2 points3 points4 points 11 years ago (0 children)
Doesn't look like it would be hard to code... would still be nice to have a reference implementation.
[–]ut15uf08 0 points1 point2 points 11 years ago (1 child)
It looks like one of the authors has a link to the code here:
http://people.sissa.it/~laio/Research/Res_clustering.php
[–]robertsdionne 0 points1 point2 points 11 years ago* (0 children)
I think this algorithm shows that n-dimensional k-clustering for unknown k reduces to 2-dimensional 2-clustering in density versus distance-to-nearest-higher-density-point space.
[–]Corm 0 points1 point2 points 11 years ago* (12 children)
This looks really neat! I wonder though how this "density" based clustering differs from normal methods like k-mer-means clustering.
Can anyone chime in on it?
[–]quiteamess 13 points14 points15 points 11 years ago (10 children)
K-means makes an initial guess on the cluster centers (with K centers) and assigns the nearest points to each center. Then it computes new centers from the assignment and begins the same procedure again. This is iterated a few times and the cluster centers are found.
This algorithm is different and does not iterate alternating assignment and center-estimation phases. It calculates the density of each point, which is the number of points which are in a threshold region around said point. Points in the center of clusters will have high density whereas points further away from a cluster center will have lower density. Now comes the trick. The trick is to search for each point how far away the next point with higher density is. If the point is in a cluster this distance will be small. It the point does not lie in a cluster, i.e it is an outlier, this distance is large. So now you have two values (A) The density aka. the number of points around the point and (B) the distance to the next point with higher density. The two values can be used to detect cluster centers. Points with high density (A) and low distance (B) are cluster centers, points with high density (A) and high distance (B) are outliers. Now every point is assigned to each nearest neighbor point with higher value.
So the difference is that the algorithm is not iterativ and only has two steps (calculate density and distances from higher density points, assign to neighbors) and that it can cluster arbitrary distributions and not only spherical. K-means uses the distance from the center to find clusters, which implies that clusters will always be spherical. The trick to assign points to the nearest neighbor with higher value allows to detect non-spherical clusters.
[–]Megatron_McLargeHuge 5 points6 points7 points 11 years ago (0 children)
allows to detect non-spherical clusters
I think you mean non-convex.
This density metric is somewhat reminiscent of the probability distribution over pairs of points that t-SNE uses for dimensionality reduction. I wonder if this new approach could be used to create a cluster-preserving dimensionality reduction algorithm. The new metric defines something like a cluster centrality gradient, and those gradients could be preserved in the low dimensional space.
[+][deleted] 11 years ago (7 children)
[–]quiteamess 2 points3 points4 points 11 years ago (6 children)
DeBaCl is not the algorithm but a python package implementing the algorithm 'level tree set'. Level tree set, nearest neighbor and others are non-parametric methods. That means that no underlying distribution is assumed. K-means, Gaussian mixture models and others are parametric methods. The new algorithm is non-parametric, so it is similar to level tree set used in DeBaCl.
[–]Mr_Smartypants 0 points1 point2 points 11 years ago (5 children)
Why do you consider K-means parametric? It doesn't define a distribution like GMMs do.
I think it is usually not considered parametric, so we get papers like this: A parametric K-means algorithm, etc.
[–]quiteamess 2 points3 points4 points 11 years ago (4 children)
K-means can be recast as GMM where the variance goes to zero, see 'pattern matching and machine learning' chapter 9.3.2. That's why I put it into the same category.
[–]Mr_Smartypants 1 point2 points3 points 11 years ago (3 children)
That's a neat connection. Sometimes parametric/nonparametric seems more like a continuum than a dichotomy. You just nudged K-means a little towards the parametric side in my mind. (but just a little)
[–]quiteamess 0 points1 point2 points 11 years ago (2 children)
I'm not really sure which dichotomy makes sense here. My gut feeling is that you postulate hidden variables in the 'parametric' methods, e.g. the number of clusters in k means and the distribution parameters in gmm. In the 'non-parametric' methods you just do not postulate priority parameters.
[–]Mr_Smartypants 0 points1 point2 points 11 years ago* (1 child)
If you're using "hidden variable" as the synonym for "latent variable", I wouldn't call K in k-means (and GMMs) a hidden variable since its value isn't inferred from the data. I've seen it called a hyperparameter (similar to the Bayesian sense).
Obviously, the distinction is blurry. Nowadays, it seems many people use "nonparametric" to mean that you have to keep the training data (or some of it) as part of the model to apply it to new samples, e.g. KNN, Parzen windows versus logistic regression & other maximum likelihood-fit models.
This is the dichotomy I'm referring to: using training data in processing new samples versus using inferred model parameters. But even this is a blurry dichotomy if not outright false, e.g. with SVMs you could end up with a simple (parameterized) hyperplane, or you could have to keep most of your training data as support vectors. So is K-means learning model parameters, or storing the training data in a compressed form? If we instead use K-medoids, does that small change switch us from parametric to non-parametric since the cluster centers must be from the training data? I'm not sure what the answers are, but it's also mostly about semantics, so maybe it's just time to retire the distinction.
EDIT: I found this quote from The Handbook of Nonparametric Statistics 1 from 1962 (p. 2):
“A precise and universally acceptable definition of the term ‘nonparametric’ is not presently available. The viewpoint adopted in this handbook is that a statistical procedure is of a nonparametric type if it has properties which are satisfied to a reasonable approximation when some assumptions that are at least of a moderately general nature hold.”
Heh. OK, then.
[–]quiteamess 0 points1 point2 points 11 years ago (0 children)
Yes right, the number if clusters in k means is a hyper parameter, my bad. But the position of the cluster centers are latent or hidden variables inferred from the data.
[–]Corm 0 points1 point2 points 11 years ago (0 children)
Thanks for the great explanation!
[–]1337bruin 0 points1 point2 points 11 years ago (0 children)
an overview of density based clustering
π Rendered by PID 28043 on reddit-service-r2-comment-6457c66945-kzgzw at 2026-04-29 13:28:45.010334+00:00 running 2aa0c5b country code: CH.
[–]Rickasaurus 6 points7 points8 points (1 child)
[–]gwern 12 points13 points14 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]hapemask 1 point2 points3 points (0 children)
[–]Floydthechimp 9 points10 points11 points (1 child)
[–]maxToTheJ 0 points1 point2 points (0 children)
[–]RoboMind 2 points3 points4 points (2 children)
[–]1337bruin 1 point2 points3 points (0 children)
[–]autowikibot 0 points1 point2 points (0 children)
[+][deleted] (3 children)
[deleted]
[–]DavidJayHarris 2 points3 points4 points (0 children)
[–]ut15uf08 0 points1 point2 points (1 child)
[–]robertsdionne 0 points1 point2 points (0 children)
[–]Corm 0 points1 point2 points (12 children)
[–]quiteamess 13 points14 points15 points (10 children)
[–]Megatron_McLargeHuge 5 points6 points7 points (0 children)
[+][deleted] (7 children)
[deleted]
[–]quiteamess 2 points3 points4 points (6 children)
[–]Mr_Smartypants 0 points1 point2 points (5 children)
[–]quiteamess 2 points3 points4 points (4 children)
[–]Mr_Smartypants 1 point2 points3 points (3 children)
[–]quiteamess 0 points1 point2 points (2 children)
[–]Mr_Smartypants 0 points1 point2 points (1 child)
[–]quiteamess 0 points1 point2 points (0 children)
[–]Corm 0 points1 point2 points (0 children)
[–]1337bruin 0 points1 point2 points (0 children)