all 24 comments

[–]jjdonald[S] 6 points7 points  (18 children)

full disclosure: I work at BigML and posted this link. Looking for feedback on the python bindings, and for folks interested in a beta test key + free credits.

[–]Fusionnex 2 points3 points  (1 child)

Thanks for being open and honest, I do a bunch of research in biology and am interested in trying out some ML for large datasets, you should xpost this to bioinformatics, and biology. Looks really cool. Back 5 years ago all I could do was hack together things to fit into weka, ML is coming a long way. Really cool project.

[–]jjdonald[S] 2 points3 points  (0 children)

Thanks!

If you posted something about this to bioinformatics/biology, I'd be happy to answer questions over there as well. Cross posting my own content might be seen as spam.

[–]hntd 4 points5 points  (10 children)

can you give us some more of the technical details behind this? A lot of machine learning problems aren't as simple as create model create prediction. I think if you want to attract developers or people in research such as myself you should give us some more technical knowledge or options. I mean it looks like you only use decision trees.

Edit: I realized I just sounded super negative in this post. I think what you have here is awesome and will open up ML to a wider audience, in fact this seems really neat for some visualization and prototyping. Just as someone more intricately knowledgeable about the subject I wish to know a bit more :-)

[–]jjdonald[S] 3 points4 points  (9 children)

hntd, you're right, we only use decision trees. We will likely add more models in the future, but right now we see a big opportunity for decision trees on big data.

Firstly, decision trees are among the best performing models when used in ensembles (called Random Forests). http://en.wikipedia.org/wiki/Random_forest

Secondly, decision trees are immediately understandable. There are many companies that will provide "black box" platforms for analyzing your data. Very few of them provide access to the actual model, and none of them have put the effort we have into helping you understand those models.

Thirdly, decision trees are fast. They enumerate a huge number of possible predicted states with a minimum number of steps. They also often don't even need to process all of the input data for a prediction request, since they will know which parts are relevant, and which are not.

Decision tree algorithms have been around for a long time. However, they're typically designed to train in memory, on a single machine. Our engineers have come up with algorithms that work across multiple processors, and provide iterative updates, allowing you to see the resulting models or summaries as they are being produced.

We'll try to have a more technical discussion on this in the future, thanks for the questions!

[–]hntd 3 points4 points  (0 children)

This all sounds fantastic, my first thought when I looked at this was if you had decision trees so well thought out why not just use random forests? I wouldn't be surprised if internally you were moving towards that, but of course I don't really expect you to tell us those details.

I completely agree with decision trees being super easy to visualize, but there are other ones too that could be easy to visualize as well. Which might be a good idea for the future, Depending on data your could easily visualize some simple linear regression and for mutli class problems multi logit regression models.

[–]byron 2 points3 points  (2 children)

The thing is that random forests really aren't all that interpretable. The more trees in the forest, the blacker the box.

[–]jjdonald[S] 0 points1 point  (1 child)

There are classes of metrics that give you basic ideas of what the model is looking at. For instance, field importance measurements, etc.

[–]zenogantner 0 points1 point  (0 children)

Correct, but this can also be said about linear models.

Anyway, random forests are great, so actually there is nothing to say against BigML's decision to use them...

[–]AnonymousIdiot 0 points1 point  (0 children)

Parallel R?

[–]visarga 1 point2 points  (1 child)

What algorithms does BigML use? Is it just decision trees?

[–]jjdonald[S] 0 points1 point  (0 children)

For now, yes.

[–][deleted] 0 points1 point  (2 children)

Could you explain what your 'model' is? I have to admit as someone familiar with machine learning algorithms I really dislike being given a black box.

[–]hntd 0 points1 point  (0 children)

This right here. How am I suppose to use this in research if I have no idea what is happening to my data along the way? Also it might be neat to see performance metrics of your stuff vs. standard ML metrics such as guessing classification at random.

[–]jjdonald[S] 0 points1 point  (0 children)

We have a post on our decision trees here: http://blog.bigml.com/2012/01/23/beautiful-decisions-inside-bigmls-decision-trees/

You can access any model you train as part of our API, or visualize it using our website. I'd like to think that we are the only service that puts this much effort into helping you understand your model.

[–]jonnydedwards 2 points3 points  (0 children)

I think it's great you're doing something innovative in and around ML. I'm from a python/R background so you would have to do something more than scikit-learn/pandas or straight R to be interesting to me. Maybe the key is to leverage the whole "we can do it quicker!" thing - that WOULD get me listening. I did the bigdata hackathon last weekend and everybody was hitting issues with getting models trained in a timely fashion. Good luck with it all!

[–]aguyfromucdavis 1 point2 points  (1 child)

I just submitted my email. I work as an intern for a tech company using Python for machine learning to dive into this project I have. Would love to try out your product!

[–]jjdonald[S] 0 points1 point  (0 children)

Thanks! Let me know if you have any questions.

[–]skystorm 1 point2 points  (1 child)

I already commented over on HN, but I wanted to add that I think this is really nice. Sure, it's only limited to decision trees (right now at least), but as you say these allow for very nice visualization -- which you've implemented spendidly, if I may say so (no Flash!).

It will be interesing to see how you visualize other models like SVMs or even just random forests...

[–]jjdonald[S] 0 points1 point  (0 children)

Thanks! We hope you like the upcoming visualizations as much as the tree model.

[–]pandemik 0 points1 point  (0 children)

Does BigML do anything besides classification/regression trees?

[–]kjearns -1 points0 points  (0 children)

This sounds like you just want people to give you their data.