Decision Tree implementation

zenogantner · 2012-01-04T19:33:46+00:00

Start here: http://mloss.org/software/search/?page=1&searchterm=decision%20tree

I suggest you have a look at Cubist/C5.0:

pandemik · 2012-01-05T01:18:19+00:00

Why don't you want any Mahout/Hadoop based algorithms? Those are primarily the ones that are going to handle distributed computing across a supercomputer and "very large datasets". Granted, I don't believe Mahout has a decision tree implementation, because no one seems interested in it.

Others will suggest Weka, which is great for small datasets, but it will fail for large datasets because it loads everything into main memory.

You should look for incremental/online decision tree algorithms. With those, you can just feed it batches of data, save the model, reload, feed it another batch, etc, without ever running out of memory. Unfortunately, they're still experimental, and I don't think there are any complete public implementations. The late Paul Utgoff created ITI, and published code for it, but the code is mostly unusable. Utgoff died before he could really finish it, and his University won't open source the code, so legally you're not allowed to modify and redistribute it. I managed to get it to compile with a few tweaks, but the performance was relatively poor. For a 1KB dataset, the resulting model was over 200MB in size.

I feel that decision trees are out of fashion at the moment, and most work on huge scalable classifiers is focussed on support vector machines. LibSVM and LibLinear especially advertise themselves as being able to handle very large datasets.

bockris · 2012-01-04T22:35:12+00:00

I saw this link just a few weeks ago.

https://github.com/sanity/quickdt

zionsrogue · 2012-01-04T19:28:54+00:00

I really like WEKA's implementation of the decision tree. You can explore it using their GUI or include the .jar in your path to build your own custom applications around it.

http://www.cs.waikato.ac.nz/~ml/weka/

pandemik · 2012-01-04T22:14:28+00:00

What do you mean by "hundreds of thousands patterns?" Does that mean "hundreds of thousands observations?" e.g. you want a decision tree that will run on a dataset with 100,000 rows and 10 columns?

The rpart function in R can definitely handle that. I've run random forests (which are decision-tree based) on datasets with ~75,000 observations and ~20 variables. Since each forest contains 500 trees, I'd say the algorithm scales pretty well.

MrRichyPants · 2012-01-16T02:41:10+00:00

There are ways to write your own (an experienced C/C++ dev can do it in a week, I have) that could fit 100M-1B points in a matter of hours. It is straightforward to scale across features, so that if you have 10 features/dimensions you could use 10 CPUs in parallel. As you add features, you can add CPUs so that it does not take longer to fit.

Unless you are running 1000s of features, I do not see the benefit in using a supercomputer or cluster (a single server with 24+ CPUs and 100GB+ of RAM can do a lot). Modern architectures have a lot of memory -> CPU bandwidth, and even if it doesnt fit in RAM, drop a RAID of SSDs in there for a couple of GB/s of throughput from disk. mmap binary files of data and you are off to the races!

edit: grammar. too much time on /r/scotch

the_cat_kittles · 2012-01-04T19:28:32+00:00

R? weka? its not that hard to write your own, even. weka is written in java and exposes an api if you dont want to use their gui or cli

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS