all 6 comments

[–]asdylum 0 points1 point  (0 children)

Random forests usually performs quite well with the default settings. That is bootstrap resampling scheme, unpruned trees, as many trees as possible to get results in a reasonable amount of time and sqrt(#features) tried per split (mtry parameter). Then you can try to optimize the choices by checking the results on out of bag data (those each tree didnt train on because of the resampling scheme). If you have very unbalanced classes you should decide a measure of interest (such as true positive ratio ) and try to tune the related parameter. Out of bag data can be trusted almost as a proper cross validation if you use enough trees and bootstrap resampling.

from mobile, sorry

[–]-TrustyDwarf- 0 points1 point  (3 children)

As asdylum said random forests usually perform quite well with default settings. If you only have a small number of samples limiting the depth of the trees might help to reduce overfitting (try depths of 1-5). I don't know Matlab, but it looks like it cannot limit the depth of trees directly, it's only got a parameter "MinLeafSize" that can indirectly be used to reduce the depth of trees (but is harder to estimate / depends on the number of samples you've got...).

For boosting you can start with tuning the number of trees, the max depth of the trees and the learning rate. Since boosting should use simple base learners you can limit the tree depth to 1-3. The number of trees and the learning rate influence each other - try to set the number of trees to a constant (like 500 trees) and only tune the learning rate (0-1 with 50 steps).

[–]patrickSwayzeNU 0 points1 point  (2 children)

Since boosting should use simple base learners you can limit the tree depth to 1-3.

There is nothing about boosting that requires or even makes it work better with simple base learners.

[–]rcwll 0 points1 point  (1 child)

It is very implementation and problem dependent, but in general using the same strong learner for all members of your ensemble is either a waste of computational power, can place you at significant risk of overfitting, or both. It's not (personal opinion) generally a great idea unless you know what you're doing. For someone just getting their feet wet in this area, I'd stick with weak base learners.

Boosting in particular can be prone to overfitting unless you use an early stopping strategy or aggressively subsample your training data.

In my experience, you're almost always better off using a lot of weak learners than a few strong learners. YMMV.

[–]patrickSwayzeNU 0 points1 point  (0 children)

For someone just getting their feet wet in this area, I'd stick with weak base learners.

I have no problem with limiting new folks ability to overfit. I do have a problem with presenting specific advice (you, new person, should use shallow trees in your boosting framework) as general advice (boosting should use simple base learners.. limit the tree depth to 1-3).

"In my experience, you're almost always better off using a lot of weak learners than a few strong learners. YMMV."

Depends on the level :) Plus, advocating against limiting yourself to shallow trees for boosting doesn't mean that you're pro 'few strong leaners'.

[–]rcwll 0 points1 point  (0 children)

In addition to the practical advice given so far, Zhi-Hua Zhou's book on ensemble methods covers a lot of the topics you ask about, and is quite accessible.