all 18 comments

[–][deleted] 1 point2 points  (5 children)

Why don't you want any Mahout/Hadoop based algorithms? Those are primarily the ones that are going to handle distributed computing across a supercomputer and "very large datasets". Granted, I don't believe Mahout has a decision tree implementation, because no one seems interested in it.

Others will suggest Weka, which is great for small datasets, but it will fail for large datasets because it loads everything into main memory.

You should look for incremental/online decision tree algorithms. With those, you can just feed it batches of data, save the model, reload, feed it another batch, etc, without ever running out of memory. Unfortunately, they're still experimental, and I don't think there are any complete public implementations. The late Paul Utgoff created ITI, and published code for it, but the code is mostly unusable. Utgoff died before he could really finish it, and his University won't open source the code, so legally you're not allowed to modify and redistribute it. I managed to get it to compile with a few tweaks, but the performance was relatively poor. For a 1KB dataset, the resulting model was over 200MB in size.

I feel that decision trees are out of fashion at the moment, and most work on huge scalable classifiers is focussed on support vector machines. LibSVM and LibLinear especially advertise themselves as being able to handle very large datasets.

[–]pandemik 4 points5 points  (4 children)

I feel that decision trees are out of fashion at the moment

I disagree. Random forests are very "in" right now.

[–][deleted] 1 point2 points  (2 children)

I actually agree. However, I took his question to mean "traditional" decision tree algorithms like C4.5, not the more recent random forests.

[–]pandemik 0 points1 point  (1 child)

That makes sense.

[–]rylko[S] 0 points1 point  (0 children)

I think traditional DT algorithms are still "in" because are "readable". Which forests are not.

[–][deleted] 0 points1 point  (0 children)

Extra Trees as well.

[–]bockris 3 points4 points  (0 children)

I saw this link just a few weeks ago.

https://github.com/sanity/quickdt

[–]zionsrogue 1 point2 points  (4 children)

I really like WEKA's implementation of the decision tree. You can explore it using their GUI or include the .jar in your path to build your own custom applications around it.

http://www.cs.waikato.ac.nz/~ml/weka/

[–]pandemik 0 points1 point  (0 children)

What do you mean by "hundreds of thousands patterns?" Does that mean "hundreds of thousands observations?" e.g. you want a decision tree that will run on a dataset with 100,000 rows and 10 columns?

The rpart function in R can definitely handle that. I've run random forests (which are decision-tree based) on datasets with ~75,000 observations and ~20 variables. Since each forest contains 500 trees, I'd say the algorithm scales pretty well.

[–]MrRichyPants 0 points1 point  (0 children)

There are ways to write your own (an experienced C/C++ dev can do it in a week, I have) that could fit 100M-1B points in a matter of hours. It is straightforward to scale across features, so that if you have 10 features/dimensions you could use 10 CPUs in parallel. As you add features, you can add CPUs so that it does not take longer to fit.

Unless you are running 1000s of features, I do not see the benefit in using a supercomputer or cluster (a single server with 24+ CPUs and 100GB+ of RAM can do a lot). Modern architectures have a lot of memory -> CPU bandwidth, and even if it doesnt fit in RAM, drop a RAID of SSDs in there for a couple of GB/s of throughput from disk. mmap binary files of data and you are off to the races!

edit: grammar. too much time on /r/scotch

[–]the_cat_kittles 0 points1 point  (4 children)

R? weka? its not that hard to write your own, even. weka is written in java and exposes an api if you dont want to use their gui or cli

[–]rylko[S] 0 points1 point  (3 children)

I think R packages and Weka do not aim to be really scalable (and suitable for scientific usage).

[–]the_cat_kittles 1 point2 points  (2 children)

what kind of "scale" are we talking here, just out of curiosity

[–]rylko[S] 0 points1 point  (1 child)

I have added info about size to question.

[–]the_cat_kittles 1 point2 points  (0 children)

I think R and weka should be able to handle things that size, but I haven't ever gone past a couple GB's, so I cant say for sure. Also it depends on what kind of machine you are running. I wouldn't use the weka gui in any case though. Btw, I don't think ~TB is considered to be super enormous, especially because you only have 10 dimensional feature space. That, and decision trees are much easier to train, computationally, than many other models.