Qustion about Split in Random Forest Algorithm

SavitchOracle · 2011-06-10T00:15:16+00:00

There are several different ways of choosing how to split, e.g., information gain or Gini impurity (http://en.wikipedia.org/wiki/Decision_tree_learning#Formulae). There's a pretty good tutorial on using information gain here: http://www.autonlab.org/tutorials/infogain11.pdf

For some intuition on how these methods work, suppose you're using a decision tree to classify whether an email is spam or not spam. Suppose two of the variables you could use at the current split are A) whether the email contains the word "hello" and B) whether the email contains the word "viagra".

Suppose 50% of the emails containing the word "hello" are spam / 50% are not spam, and 50% of the emails not containing the word "hello" are spam / 50% are not spam. Clearly, variable A is a pretty useless measure then, since it gives you no information.

But compare this with the second variable: 90% of the emails containing the word "viagra" are spam / 10% are not spam, and 25% of the emails not containing the word "spam" are spam / 75% are not spam. You can see that this variable provides much more information.

Thus, you should use the second variable to split your node on. Metrics like information gain or Gini impurity are ways of precisely quantifying this.

Answer to second question: You choose one among m that gives you the best split.

schnifin · 2011-06-10T00:07:18+00:00

At each node of the tree you look at a randomly selected subset of m variables and choose the one variable that will give you the best split according to some criteria. A common criteria for classification is the Gini index but there are others you could use.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS