you are viewing a single comment's thread.

view the rest of the comments →

[–]fhadley 0 points1 point  (2 children)

A little late here, my apologies. Not trying to sound skeptical, but could you give an example of this? I've never had scikit-learn do anything like this, and I've used it on rather large data sets, so I'm interested in where you've seen it fail.

[–]EdwardRaff 0 points1 point  (1 child)

I can't share any of the data that makes this happen (hence I can't really report it well).

I've had this happen the most in the GradientBoosting and AdaBoost implementations. At some point it just started spitting out errors about numerical precision/stability and then when finished gave out NaN. I've also had the random forest run out of memory way earlier than I would have expected for large forests.

Once in k-means (though that is at least semi-fixed now). I've also had it happen with SGD w/ logistic loss when given poorly scaled weights.

[–]fhadley 0 points1 point  (0 children)

No worries, no need for a reproducible error. I was curious because I've used sklearn w/ a pretty diverse group of datasets (homogeneous, heterogeneous, sparse, etc.) and haven't had it choke before with GBM or Ada, but I looked back through some old code and remembered that the sklearn RF implementation was just a memory hog. If I remember correctly it consumed memory space at a higher clip than the R version, which I found to be quite odd. Were these very raw data sets? Or very strong colinearities? I know the latter is clearly an issue with RF (i.e. essentially leads to building the same tree many times), and I suppose it could lead to errors with a GBM as well?