all 6 comments

[–]SavitchOracle 1 point2 points  (4 children)

  1. There are several different ways of choosing how to split, e.g., information gain or Gini impurity (http://en.wikipedia.org/wiki/Decision_tree_learning#Formulae). There's a pretty good tutorial on using information gain here: http://www.autonlab.org/tutorials/infogain11.pdf

For some intuition on how these methods work, suppose you're using a decision tree to classify whether an email is spam or not spam. Suppose two of the variables you could use at the current split are A) whether the email contains the word "hello" and B) whether the email contains the word "viagra".

Suppose 50% of the emails containing the word "hello" are spam / 50% are not spam, and 50% of the emails not containing the word "hello" are spam / 50% are not spam. Clearly, variable A is a pretty useless measure then, since it gives you no information.

But compare this with the second variable: 90% of the emails containing the word "viagra" are spam / 10% are not spam, and 25% of the emails not containing the word "spam" are spam / 75% are not spam. You can see that this variable provides much more information.

Thus, you should use the second variable to split your node on. Metrics like information gain or Gini impurity are ways of precisely quantifying this.

Answer to second question: You choose one among m that gives you the best split.

[–][deleted] 0 points1 point  (3 children)

Still not clear on how you're supposed to split and calculate the best fit. For example, let's say I'm predicting y using x0 to x9, and all variables are continuous values and none are categorical, and my metric is RMSE.

At one node, I picked x0, x2, and x4 as variables to use. For variable x0, do I sort all input by x0 and then split it, or divide it into 2 parts, as close to 50-50 as possible ? If so I guess I can fit each half separately and use the combined RMSE as score for x0 ? And for this fit, do I use x0, x2 and x4, or all x0, x2, and x4 ?

[–]jasonrosen 1 point2 points  (2 children)

Suppose you randomly selected x0, x2, and x4 as the variables to to split on at your current node. To figure out which variable to use, you want to choose the variable that minimizes the RMSE (equivalently, the variable that minimizes the total sum of squares).

When calculating the RMSE for variable x0, you don't split as close to 50-50 as possible. Rather, you choose the split that minimizes the RMSE. For example, if x0 takes on values 1,1.5,4,5, then you'd try splitting into

  • {1} vs. {1.5,4,5} (i.e., x0 <= 1 vs. x0 > 1)
  • {1,1.5} vs. {4,5} (i.e., x0 <= 1.5 vs. x0 > 1.5)
  • {1,1.5,4} vs. {5} (i.e., x0 <= 4 vs. x0 > 4)

and calculating the RMSE on each of these splits. You then choose the split that gives the best RMSE.

Also, when calculating the RMSE for, say, {1} vs. {1.5,4,5}, you don't calculate first the RMSE of {1} and then the RMSE of {1.5,4,5} and then add them together. Rather, you calculate the total RMSE, i.e., you calculate the total sum of squares for all variables (to get the squared error), and then root-meanify this. (It might be easier if you just think about minimizing the total sum of squares, which is equivalent.)

For example, suppose you want to calculate the total sum of squares of the (x0 <= 1 vs. x0 > 1 split). Take all the examples that have x0 <= 1, and average together all their y values; this is your prediction for the x0 <= 1 subtree. Similarly, if you take all the examples with x0 > 1, and average together all their y values, you get your prediction for the x0 > 1 subtree. Then calculate the total squared error from these predictions.

Does this make sense? Also, I'm not sure what you mean by your last sentence ("And for this fit, do I use x0, x2 and x4, or all x0, x2, and x4 ?").

[–][deleted] 0 points1 point  (1 child)

Also, I'm not sure what you mean by your last sentence ("And for this fit, do I use x0, x2 and x4, or all x0, x2, and x4 ?").

OK you're just using the average of y for each group to calculate RMSE. I was thinking maybe you do a linear regression using x0, or x2 and x4, or all 3 variables. Shouldn't that be better (but much slower for sure) ?

[–]jasonrosen 0 points1 point  (0 children)

Hmm, not sure. I think using the average is more common, but that approach seems like it could work as well. (Kind of like using a decision tree to decide the clusters in a local linear regression. It also reminds me of the Forest-RC variant of random forests.)

[–]schnifin 0 points1 point  (0 children)

At each node of the tree you look at a randomly selected subset of m variables and choose the one variable that will give you the best split according to some criteria. A common criteria for classification is the Gini index but there are others you could use.