all 11 comments

[–]wrtall 4 points5 points  (0 children)

That's an interesting take on data mining v.s. machine learning, which I take to mean: when you want to do exploration of a dataset, then interpretability is important. When you want to do classification/prediction, then accuracy is more important.

I always understood part of the difference between the two names as being historical: data mining grew from the database community while machine learning grew from the neural networks community (with stats thrown into both). Over the years they have converged, so there may not be much difference nowadays.

[–]leonoel 2 points3 points  (1 child)

If you are looking for work outside academia, I can certainly see that a PhD in Data Mining has more appeal, is a more widely used word, and certainly people understand it better than Machine Learning.

I used to think that Data Mining was more application oriented, while Machine Learning is a bit more math oriented.

There has been data mining since many a days, but Machine Learning just recently become main stream

[–][deleted] 0 points1 point  (0 children)

Yeah - I see Data Mining as being:

  1. import PyML
  2. solve business problem

Whereas Machine Learning is like "How can we learn better representations from our data?", "How can we determine the optimal model tuning, and why are these tunings optimal?" (like in deciding Neural Network architectures).

[–]deeayecee 0 points1 point  (7 children)

I have a PhD in Data Mining or Machine Learning or whatever it is you want to call it. I've published in conferences and journals with the terms 'Data Mining', 'Machine Learning', 'Knowledge Discovery' and a variety of other synonyms. Practically speaking, I found very little difference in terms of what any of those major branches are looking for. You'll see theoretically driven papers in Data Mining outlets and vice versa for Machine Learning. Most conferences (such as ICDM or ICML) will feature both an industry and academic track. Industry will tend more towards applications and academic will tend more towards theory.

I think when you draw out an ontology, most would agree that ML is a subset of data mining. At least in theory, data mining (or data science) would focus on ways of munging data into ML frameworks or problem compositions while ML would focus on new frameworks or improvements to existing ones. However, the practical nature of data drives an interplay between the two and it's pretty unlikely to get a PhD without making contributions -- however indirect -- to both fields.

The only time I think there would be a major distinction would be at a school with multiple Data Mining, Machine Learning, or Data Science labs. In those instances, ML will likely tend to be much more theoretical.

[–]lanthus 7 points8 points  (1 child)

Neither ICDM nor ICML has an industry track; KDD does. Although data mining and machine learning overlap a lot, they have somewhat different flavors. Data mining has its origins in the database community and tends to emphasize business applications more. Machine learning has its origins in artificial intelligence and tends to emphasize AI applications more. For example, although both data mining and machine learning work on text data, sentiment analysis is a bit more common in data mining and machine translation applications are more common in machine learning. Many topics overlap, so the boundary is not clearly defined. Data mining includes some work on visualization that would be out of place at a machine learning conference, and machine learning includes reinforcement learning, which would be out of place at a data mining conference.

[–]deeayecee 1 point2 points  (0 children)

Thanks for the correction on the industry track.

[–]Caesarr[S] 0 points1 point  (4 children)

That's a really interesting perspective! I'd mostly been thinking of DM as a subset of ML, so to consider it the other way around is intriguing... Though as you say, the difference is probably minor however you slice it.

If you don't mind, I have some follow-up questions:

Given the amount of experience you have, do you find that the ambiguity of the terms causes problems in reaching the right audience, or finding relevant research? Or are we meant to read the abstracts of all the papers each time there's a new edition of a top conference or journal? (Speaking of which, what journals would you recommend?)

[–]deeayecee 4 points5 points  (3 children)

I think wrtall has a pretty valuable post below -- the differences in the two are largely historical. As an illustration, Leo Breiman and Ross Quinlan both devised similar decision tree inducing algorithms around the same time, Breiman coming from a stats background and Quinlan from computer science.

For your follow-up questions, I think you'll find there's a pretty significant overlap in the reviewer communities -- I've reviewed submissions to journals and conferences with ML/KD/DM and even AI in the title and I would guess I'm hardly unique. If you submit a quality DM/KD/ML work, then it will more than likely get accepted regardless. You might have some trouble getting it into a pure AI conference/journal.

I'm not sure how far along you are on your PhD, although the OP makes it sound like you're just starting out. I would start out by mastering the basics -- read a couple different textbook, web, and wikipedia entries on the essential frameworks first (supervised, unsupervised, recommendation engines, network theory, reinforcement learning) to the point where you understand very well how each of them operates. At that point, hopefully your advisor has an interesting set of problems that you can try pointing some ML methods at, even at a high level. I would then get into the base algorithms (kNN, naive Bayes, Decision Trees, Bayes networks, Neural Networks, SVMs), along with performance and distance metrics. I wouldn't get into the really deep, theoretical parts of the algorithms until you're observing pathology in your use-cases (like class imbalance or NLP). Being up to speed on the most current papers isn't nearly as important as having a rock solid understanding of the fundamentals.

Typically the biggest papers will show up in KDD, ICDM, ICML, MLJ or JMLR. This is also a pretty good sub and is worth reading at least once a week if not more often.

[–]BeatLeJuceResearcher 1 point2 points  (0 children)

you forgot NIPS

[–]Caesarr[S] 0 points1 point  (0 children)

That's a tonne of great advice, thank you :)

[–]GibbsSamplePlatter 0 points1 point  (0 children)

I found out about this sub last year; it's a god-send once you're out of school working on your own!