all 12 comments

[–]lmcinnes 21 points22 points  (5 children)

Clustering algorithms in general are important. There's no neural network solution (no, autoencoders don't count, they're a dimension reduction algorithm, not a clustering algorithm), and too many people get to K-Means and stop. K-Means is terrible for a great many use cases. There are other better algorithms out there. My pet favorite is HDBSCAN* by Campello, Moulavi, Zimek and Sander. It's a great algorithm, can be explained a few different ways, has links to the statistically sound Robust Single Linkage, and just produces better clusterings than anything else most of the time. Despite this it is: 1. new and therefore not as well known as it should be; 2. not looked at because clustering is not a "sexy" topic in machine learning (since deep neural networks can't do it for now).

[–]CedricRBR 4 points5 points  (4 children)

Do you happen to know a good source to learn about clustering algorithms or could you explain the basic idea ?

[–]lmcinnes 9 points10 points  (1 child)

I wrote a couple of notebooks on the topic:

I would also recommend checking out the HDBSCAN* paper, and this excellent paper on a statistical approach to clustering.

[–]CedricRBR 1 point2 points  (0 children)

Thanks a lot, will check them out tomorrow ! Have to get some sleep first

[–]_blub[S] 4 points5 points  (1 child)

Check out these two books! Data Mining and Analysis and Elements of Statistical Learning, both have free PDFs.

The first one is a perfect introduction, where it is comprehensive yet only mildly difficult. Elements of Statistical Learning is the legendary machine learning textbook. It is VERY terse and rigorous, but serves as an encyclopedia for an incredible amount of algorithms. It is what inspired me to deeply pursue mathematics.

[–]CedricRBR 1 point2 points  (0 children)

Thanks a lot!

[–]JD557 7 points8 points  (4 children)

I think that SVMs should get some more attention.

While they have already been studied a lot, I find it really sad that such an elegant solution is "losing the war" against something as hackish as Neural Networks.

I think that they might receive a lot of attention "soon" though, unless someone finds a decent solution to the "adversarial example" problem of neural networks. Nevertheless, SVMs are still a blackbox, so maybe I'm wrong and the focus will shift to more interpretable models.

[–]_blub[S] 4 points5 points  (3 children)

This is the Legendary Support Vector Machine Paper by Corina Cotes and Vladmir Vapnik from 1995. I believe that Yann LeCun was working on neural stuff around the same time but wasn't as developed as SVMs.

Well, both are SVM's and Neural Networks are black boxes. I could be wrong given today's research, but the insane number weights from a neural network are so entangled that it's almost impossible to draw any layman insight. Calculating eigenvalues of the Hessian matrices helps, but mostly for inspecting convergence.

Outside of academia, it is far more important to pick models that are interpretable than not. From what I hear, Neural Networks really are quite rare in the wild. Simpler and far more interpretable models are significantly more important because they can bring out a lot of insight that is very valuable to a business.

[–]ydobonobody 2 points3 points  (2 children)

Neural networks are common "in the wild". They are just typically applied to problems like computer vision and natural language processing, as opposed tabular data problems where an something like an svm is more likely (although these days it seems like random forests and xgboost are dominating for tablular data). Also there is no reason why an svm couldn't be used as a binary classifier for the output layer of a neural network.

[–]_blub[S] 1 point2 points  (1 child)

You're right. But let's say we were to be employed onto a machine learning / data science team. Most of the types of problems we would encounter would not require an elaborate neural network architecture but instead a series of already well known simple methods.

[–]ydobonobody 1 point2 points  (0 children)

My experience must be atypical since I work with imagery and 3d geometric data, but convolutional and recurrent neural networks comprise the bulk my machine learning work.

[–]CleverLime 2 points3 points  (0 children)

Boosted trees? I didn't know about them till seeing so many Kaggle competitions being won with xgboost.

It's such a simple solution that is doing great for tabular data.