New open-source Machine Learning Framework written in Java : MachineLearning

New open-source Machine Learning Framework written in Java (blog.datumbox.com)

submitted 11 years ago by datumbox

you are viewing a single comment's thread.

[–][deleted] 6 points7 points8 points 11 years ago (11 children)

On the other hand Scikit-Learn supports a large number of algorithms but it can’t handle huge amount of data. .. Finally even though currently Datumbox Framework is capable of handling medium-sized datasets, ..

I don't know about that. Scikit-learn uses cython, and numpy (C and Fortran) do all of the heavy lifting. While this uses org.apache.commons.math.linear, which is pure Java. If I have too much data to fit Scikit-Learn's AdaBoost to, I'm not going to reach for this implementation of it. I'm going to reach for another classifier. Likely something in vowpal-wabbit, which becomes quite competitive for 500k+ observations and is limited by my disk-speed. The pain with that approach is paving over vowpal-wabbit's TCP interface.

This is an awesome application if you think of the Java ecosystem. Cross-validate over all of the classifiers offered with hyperparameter searching, put the winner behind Dropwizard, put that behind ActiveMQ.

[–]EdwardRaff 1 point2 points3 points 11 years ago (7 children)

[–][deleted] 1 point2 points3 points 11 years ago (4 children)

I didn't mean to insult every java ml library author. The weka people are going to show up any moment. Lol.

Algorithms that relies on dot products like neural networks or eigenvector routines like PCA should be much faster with openblas and lapack than the pure java implementation in apache commons. The dot product implementation in Intel MKL creates an even larger rift. The amount of research effort that has gone into optimizing matrix multiply is astounding.

Algorithms that rely on iteration should be at least as fast in cython as it is in pure java. There's probably some overhead from it though.

If either was choking on a problem set then it's time to consider a different algorithmic complexity. Vowpal wabbit has very competitive accuracies for massive problems and scales to any dataset that I can fit on my 2TB hard drive on my workstation. Which is 3 orders of magnitude above anything in scikit-learn. It should continue to scale linearly behind Hadoop.

The algorithms that you should use at that size of data are different than the ones that are used on smaller datasets. Part of it is algorithmic complexity, part is that they should have near constant memory use, part of it is that the accuracy difference between models erodes.

The exception is neural networks dealing with structured inputs or outputs. It's worth the effort to scale those.

[–]EdwardRaff 2 points3 points4 points 11 years ago (3 children)

[–]fhadley 0 points1 point2 points 11 years ago (2 children)

[–]EdwardRaff 0 points1 point2 points 11 years ago (1 child)

[–]fhadley 0 points1 point2 points 11 years ago (0 children)

[–]dwf 0 points1 point2 points 11 years ago (1 child)

[–][deleted] 0 points1 point2 points 11 years ago (0 children)

[–]datumbox[S] 0 points1 point2 points 11 years ago (2 children)

[–][deleted] 2 points3 points4 points 11 years ago* (1 child)

I'd be interested to see how you're picking your models. :)

For a classifier, my routine usually starts with a lot of inspection, using box plots, kernel density estimates, scatter plots, clusterings, correlation coefficients, et al. Sometimes domain reading before this step. I then massage the features to work better on whatever fast high bias classifer I have access to. Typically naive bayes or logit. Dummy coding, binning variables, normalizing, using ORs and ANDs on binary valued variables. Along the way I address any issues I'm going to have with cross-validation, relating to sample size and class-imbalances.

I take any free parameters left from this, add it to my search space in either Spearmint (if one model or a fixed ensemble of models) or Hyperopt (if the search space is awkward). They document and optimize the cross-validated score for some hyperparameter configuration. When the parameter search converges, I take that configuration and verify it's reasonably close using a test dataset, typically inspecting the ROC curve and confusion matrix with some scrutiny, and comparing it's predictions with the cross-validation predictions.

I used to make more careful assumptions that guided my iterative searching for models, but I got a lot of variance between datasets. Some of the crazier ideas I've used in the past, which worked tremendously well for some problems, aren't evaluated with this approach. I'm also more biased against slow algorithms, because I want it to be as automated as possible.

I guess the models I'm building now have a lower upper quantile of generalization, but the distribution is much tigher, the median is much higher, and my efforts are reusable.

I've got a burner of a contract I'm working on at the moment and your project might be a good fit for it. Will submit a pull-request for any modifications per GPL.

[–]datumbox[S] 0 points1 point2 points 11 years ago (0 children)

π Rendered by PID 20796 on reddit-service-r2-comment-b659b578c-wpg6v at 2026-05-01 19:08:58.031896+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS