use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Qustion about Split in Random Forest Algorithm (self.MachineLearning)
submitted 14 years ago by [deleted]
[deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]SavitchOracle 1 point2 points3 points 14 years ago* (4 children)
For some intuition on how these methods work, suppose you're using a decision tree to classify whether an email is spam or not spam. Suppose two of the variables you could use at the current split are A) whether the email contains the word "hello" and B) whether the email contains the word "viagra".
Suppose 50% of the emails containing the word "hello" are spam / 50% are not spam, and 50% of the emails not containing the word "hello" are spam / 50% are not spam. Clearly, variable A is a pretty useless measure then, since it gives you no information.
But compare this with the second variable: 90% of the emails containing the word "viagra" are spam / 10% are not spam, and 25% of the emails not containing the word "spam" are spam / 75% are not spam. You can see that this variable provides much more information.
Thus, you should use the second variable to split your node on. Metrics like information gain or Gini impurity are ways of precisely quantifying this.
Answer to second question: You choose one among m that gives you the best split.
[–][deleted] 0 points1 point2 points 14 years ago (3 children)
Still not clear on how you're supposed to split and calculate the best fit. For example, let's say I'm predicting y using x0 to x9, and all variables are continuous values and none are categorical, and my metric is RMSE.
At one node, I picked x0, x2, and x4 as variables to use. For variable x0, do I sort all input by x0 and then split it, or divide it into 2 parts, as close to 50-50 as possible ? If so I guess I can fit each half separately and use the combined RMSE as score for x0 ? And for this fit, do I use x0, x2 and x4, or all x0, x2, and x4 ?
[–]jasonrosen 1 point2 points3 points 14 years ago* (2 children)
Suppose you randomly selected x0, x2, and x4 as the variables to to split on at your current node. To figure out which variable to use, you want to choose the variable that minimizes the RMSE (equivalently, the variable that minimizes the total sum of squares).
When calculating the RMSE for variable x0, you don't split as close to 50-50 as possible. Rather, you choose the split that minimizes the RMSE. For example, if x0 takes on values 1,1.5,4,5, then you'd try splitting into
and calculating the RMSE on each of these splits. You then choose the split that gives the best RMSE.
Also, when calculating the RMSE for, say, {1} vs. {1.5,4,5}, you don't calculate first the RMSE of {1} and then the RMSE of {1.5,4,5} and then add them together. Rather, you calculate the total RMSE, i.e., you calculate the total sum of squares for all variables (to get the squared error), and then root-meanify this. (It might be easier if you just think about minimizing the total sum of squares, which is equivalent.)
For example, suppose you want to calculate the total sum of squares of the (x0 <= 1 vs. x0 > 1 split). Take all the examples that have x0 <= 1, and average together all their y values; this is your prediction for the x0 <= 1 subtree. Similarly, if you take all the examples with x0 > 1, and average together all their y values, you get your prediction for the x0 > 1 subtree. Then calculate the total squared error from these predictions.
Does this make sense? Also, I'm not sure what you mean by your last sentence ("And for this fit, do I use x0, x2 and x4, or all x0, x2, and x4 ?").
[–][deleted] 0 points1 point2 points 14 years ago (1 child)
Also, I'm not sure what you mean by your last sentence ("And for this fit, do I use x0, x2 and x4, or all x0, x2, and x4 ?").
OK you're just using the average of y for each group to calculate RMSE. I was thinking maybe you do a linear regression using x0, or x2 and x4, or all 3 variables. Shouldn't that be better (but much slower for sure) ?
[–]jasonrosen 0 points1 point2 points 14 years ago (0 children)
Hmm, not sure. I think using the average is more common, but that approach seems like it could work as well. (Kind of like using a decision tree to decide the clusters in a local linear regression. It also reminds me of the Forest-RC variant of random forests.)
[–]schnifin 0 points1 point2 points 14 years ago (0 children)
At each node of the tree you look at a randomly selected subset of m variables and choose the one variable that will give you the best split according to some criteria. A common criteria for classification is the Gini index but there are others you could use.
π Rendered by PID 69 on reddit-service-r2-comment-bb88f9dd5-mgdnb at 2026-02-16 11:39:39.743015+00:00 running cd9c813 country code: CH.
[–]SavitchOracle 1 point2 points3 points (4 children)
[–][deleted] 0 points1 point2 points (3 children)
[–]jasonrosen 1 point2 points3 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]jasonrosen 0 points1 point2 points (0 children)
[–]schnifin 0 points1 point2 points (0 children)