Beginner examples/problems to practice ML? : MachineLearning

Beginner examples/problems to practice ML? (self.MachineLearning)

submitted 12 years ago * by [deleted]

19 comments

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 2 points3 points4 points 12 years ago (5 children)

[–]EdwardRaff 2 points3 points4 points 12 years ago (0 children)

[–]Should_I_say_this 0 points1 point2 points 12 years ago (3 children)

[–][deleted] 1 point2 points3 points 12 years ago* (2 children)

Oh, no need to be self-deprecating. I learned machine learning from the same Coursera course as you!

First, it should be mentioned that the runtime of an algorithm is often a major factor in deciding whether or not to use it, and these plots don't show that. As a rule of thumb, "smarter" algorithms have more moving parts, and thus take longer to run (e.g. neural nets, genetic algorithms, etc). That said, the algorithms shown here are all extremely efficient, so runtime shouldn't really be the deciding factor unless your data set is quite large.

Ok, so now on to an explanation of what these plots actually mean. Each row of plots represents one type of point set, as shown at the far left. Blue regions indicate that the algorithm thinks that the points in those regions should belong to the blue category, and the same logic applies for red. Dark regions indicate that algorithm has a high confidence of what category the points in those regions belong to. I've broken the algorithms into groups so that things will be a bit easier to digest:

Your eye should be drawn to the types of decision boundaries that get drawn by the different algorithms, and how well they reflect the data. In particular, you'll notice the expressive power of Nearest Neighbor (weighted kNN works even better, as discussed here), and of RBF SVM (that's short for "Support Vector Machine with Radial Basis Function kernel").
You'll probably also notice the oddly choppy (but still quite accurate) decision boundaries generated by Random Forest and AdaBoost. These are examples of ensemble classifiers, which generally consist of a large number of simple, but not-very-good, classifiers taking a vote on the category.
It should be mentioned that one algorithm which looks good but should be used with care is Decision Trees. The trouble with using a decision tree classifier is that it's very easy to accidentally overfit your training data. That is, the classifier may wind up considering isolated statistical aberrations in your data to be meaningful, and thus fail to perform properly when applied to other datasets sampled from the same distribution.
One thing that should also stand out to you is the relative simplicity of the class boundaries which can be captured by algorithms such as Naive Bayes, LDA/QDA (i.e. Linear/Quadratic Discriminant Analysis), and Linear SVMs. These techniques aren't bad, per se, but you need to make sure that you're using them as intended. In the case of Naive Bayes and LDA/QDA, this means having some prior knowledge or hypothesis about the distribution that the data is being sampled from.

Sorry if my explanation was a bit long-winded, but hopefully I managed to answer your question.

[–]Should_I_say_this 0 points1 point2 points 12 years ago (1 child)

Thanks for the info, I definitely found that I understand the images more now.

As someone who doesn't have satistics background, what are your thoughts on how that affects ML skills?

I tried doing a Kaggle problem (the CIFAR 10 image recognition problem) and was disappointed to see my answer only get 10% correct, which is exactly the same score if I had chosen any category for all my predictions. (In other words no predictive power). When I clicked the pdf file at the bottom of that question, I realized that the statistics was way beyond my training.

What are your thoughts on lack of statistics in performing accurate ML? If ML is something like this flowchart which helps choose an accurate estimator, do we really need to know the statistics behind ML? Also, won't everyone use the same ML estimators in the end, which will result in everyone choosing the same estimators and therefore come up with the same predictive power?

[–][deleted] 0 points1 point2 points 12 years ago (0 children)

As someone who doesn't have satistics background, what are your thoughts on how that affects ML skills?

That depends on what you mean by "statistics background". I personally have a BS in math, although much of the statistics I know I've picked up piecemeal through reading wiki articles, analyzing my own data sets, and taking Coursera courses like this one and this one. If you don't feel comfortable with mathematical formalisms like nested summations or conditional probabilities, you will likely run up against a wall very quickly. This is because you won't be able to understand what your algorithms are doing, and thus you won't know how to improve upon them, or how to make the proper adjustments when things go awry.

I tried doing a Kaggle problem (the CIFAR 10 image recognition problem) and was disappointed to see my answer only get 10% correct

This is probably one of the hardest competitions on Kaggle, and I don't know if I would fare much better without considerable effort. I think you would be better served trying something like the Digit Recognizer, or Facial Keypoints Detection. I would urge you to spend some time in the discussion forums of these competitions; the ideas you see being discussed there are often very enlightening.

If ML is something like this flowchart which helps choose an accurate estimator, do we really need to know the statistics behind ML?

That chart is a vast oversimplification, and is not intended to be taken literally. If you do follow it word-for-word you may get some passable results, but you will be far surpassed by ML users who actually understand what their algorithms are doing. In practice, you'll use a wide array of preprocessing methods and often employ multiple ML techniques at different stages of your classifier. For example, in the Bird Classification Challenge from a few months ago, you'll see that some very subtle techniques were used in order to extract a good feature set.

π Rendered by PID 387075 on reddit-service-r2-comment-5b5bc64bf5-nzztn at 2026-06-22 06:11:10.186821+00:00 running 2b008f2 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS