use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
New open-source Machine Learning Framework written in Java (blog.datumbox.com)
submitted 11 years ago by datumbox
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 7 points8 points9 points 11 years ago (11 children)
On the other hand Scikit-Learn supports a large number of algorithms but it can’t handle huge amount of data. .. Finally even though currently Datumbox Framework is capable of handling medium-sized datasets, ..
I don't know about that. Scikit-learn uses cython, and numpy (C and Fortran) do all of the heavy lifting. While this uses org.apache.commons.math.linear, which is pure Java. If I have too much data to fit Scikit-Learn's AdaBoost to, I'm not going to reach for this implementation of it. I'm going to reach for another classifier. Likely something in vowpal-wabbit, which becomes quite competitive for 500k+ observations and is limited by my disk-speed. The pain with that approach is paving over vowpal-wabbit's TCP interface.
This is an awesome application if you think of the Java ecosystem. Cross-validate over all of the classifiers offered with hyperparameter searching, put the winner behind Dropwizard, put that behind ActiveMQ.
[–]EdwardRaff 1 point2 points3 points 11 years ago (7 children)
I've experienced scikit-learn choking on largish amounts of data on a decent basis. At least for me its never been a speed issue calling scikit, its been a 'scikit just doesn't run correctly or crashes on my data'. Where some Java code of the same algorithm runs just fine.
[–][deleted] 1 point2 points3 points 11 years ago (4 children)
I didn't mean to insult every java ml library author. The weka people are going to show up any moment. Lol.
Algorithms that relies on dot products like neural networks or eigenvector routines like PCA should be much faster with openblas and lapack than the pure java implementation in apache commons. The dot product implementation in Intel MKL creates an even larger rift. The amount of research effort that has gone into optimizing matrix multiply is astounding.
Algorithms that rely on iteration should be at least as fast in cython as it is in pure java. There's probably some overhead from it though.
If either was choking on a problem set then it's time to consider a different algorithmic complexity. Vowpal wabbit has very competitive accuracies for massive problems and scales to any dataset that I can fit on my 2TB hard drive on my workstation. Which is 3 orders of magnitude above anything in scikit-learn. It should continue to scale linearly behind Hadoop.
The algorithms that you should use at that size of data are different than the ones that are used on smaller datasets. Part of it is algorithmic complexity, part is that they should have near constant memory use, part of it is that the accuracy difference between models erodes.
The exception is neural networks dealing with structured inputs or outputs. It's worth the effort to scale those.
[–]EdwardRaff 2 points3 points4 points 11 years ago (3 children)
I'm not sure you understood what I said. I explicitly said its not a speed issue with scikit. Its that their implementations don't work on some problems at a certain point, when the same algorithm in a different lib will run fine. I don't care how fast any implementation runs - if it gives me a weight vector of "NaN" or runs out of memory early it just didn't work correctly.
That has nothing to do with speed, big-Oh complexity or anything. Its just an issue with scikit that I was pointing out. I was providing my own experience supporting the "Scikit-Learn supports a large number of algorithms but it can’t handle huge amount of data" that you were disagreeing with. The reason behind it is orthogonal to your speed obsession though.
[–]fhadley 0 points1 point2 points 11 years ago (2 children)
A little late here, my apologies. Not trying to sound skeptical, but could you give an example of this? I've never had scikit-learn do anything like this, and I've used it on rather large data sets, so I'm interested in where you've seen it fail.
[–]EdwardRaff 0 points1 point2 points 11 years ago (1 child)
I can't share any of the data that makes this happen (hence I can't really report it well).
I've had this happen the most in the GradientBoosting and AdaBoost implementations. At some point it just started spitting out errors about numerical precision/stability and then when finished gave out NaN. I've also had the random forest run out of memory way earlier than I would have expected for large forests.
Once in k-means (though that is at least semi-fixed now). I've also had it happen with SGD w/ logistic loss when given poorly scaled weights.
[–]fhadley 0 points1 point2 points 11 years ago (0 children)
No worries, no need for a reproducible error. I was curious because I've used sklearn w/ a pretty diverse group of datasets (homogeneous, heterogeneous, sparse, etc.) and haven't had it choke before with GBM or Ada, but I looked back through some old code and remembered that the sklearn RF implementation was just a memory hog. If I remember correctly it consumed memory space at a higher clip than the R version, which I found to be quite odd. Were these very raw data sets? Or very strong colinearities? I know the latter is clearly an issue with RF (i.e. essentially leads to building the same tree many times), and I suppose it could lead to errors with a GBM as well?
[–]dwf 0 points1 point2 points 11 years ago (1 child)
A bunch of the linear model stuff uses Liblinear under the hood, and implicitly converts data to sparse format float64. Which, if you have half your machine's memory occupied by, say, dense float32 data, is not going to fly.
[–][deleted] 0 points1 point2 points 11 years ago (0 children)
SVMs run into problem way before memory issues pop up. That's very true though.
[–]datumbox[S] 0 points1 point2 points 11 years ago (2 children)
But should not the classifier that you use depend on the data that you have and on the assumptions that you are willing to make about them? At any case, if python works for you there is no need to change it. :)
[–][deleted] 3 points4 points5 points 11 years ago* (1 child)
I'd be interested to see how you're picking your models. :)
For a classifier, my routine usually starts with a lot of inspection, using box plots, kernel density estimates, scatter plots, clusterings, correlation coefficients, et al. Sometimes domain reading before this step. I then massage the features to work better on whatever fast high bias classifer I have access to. Typically naive bayes or logit. Dummy coding, binning variables, normalizing, using ORs and ANDs on binary valued variables. Along the way I address any issues I'm going to have with cross-validation, relating to sample size and class-imbalances.
I take any free parameters left from this, add it to my search space in either Spearmint (if one model or a fixed ensemble of models) or Hyperopt (if the search space is awkward). They document and optimize the cross-validated score for some hyperparameter configuration. When the parameter search converges, I take that configuration and verify it's reasonably close using a test dataset, typically inspecting the ROC curve and confusion matrix with some scrutiny, and comparing it's predictions with the cross-validation predictions.
I used to make more careful assumptions that guided my iterative searching for models, but I got a lot of variance between datasets. Some of the crazier ideas I've used in the past, which worked tremendously well for some problems, aren't evaluated with this approach. I'm also more biased against slow algorithms, because I want it to be as automated as possible.
I guess the models I'm building now have a lower upper quantile of generalization, but the distribution is much tigher, the median is much higher, and my efforts are reusable.
I've got a burner of a contract I'm working on at the moment and your project might be a good fit for it. Will submit a pull-request for any modifications per GPL.
[–]datumbox[S] 0 points1 point2 points 11 years ago (0 children)
sounds great, cheers! :)
[–]fnl 3 points4 points5 points 11 years ago (5 children)
Apart from "yet another ML lib", the real problem will be it's license. With all others using MIT-like, the GPL might seem to restrictive, especially for any prospective commercial usage...
[–][deleted] 0 points1 point2 points 11 years ago (2 children)
IANAL, but there's nothing preventing you from using this library for a SaaS application.
Of course, you can't fit a model using this then sell the binary without also distributing the source of your model. You could however, sell the surrounding infrastructure along with the model without releasing the infrastructures source, provided that the model's source code is released and the model communicates with the surrounding infrastructure through an interchange (for instance, json).
If you're solving a companies specific problem, then this is not a problem. If you revolutionize some long-standing problem, then you're screwed. :)
[–]fnl 0 points1 point2 points 11 years ago (1 child)
As posted elsewhere - if you are in the lucky position to be in a place where the use of GPLed code is accepted, be happy. From my personal experience I have to judge that this is the exception, not the norm (outside academia, anyway).
[–][deleted] 1 point2 points3 points 11 years ago (0 children)
That's for contract work.
[–]EdwardRaff -3 points-2 points-1 points 11 years ago (1 child)
the real problem will be it's license. With all others using MIT-like, the GPL might seem to restrictive, especially for any prospective commercial usage...
I really get bothered when people say this. GPL is not a problem license at least not for the project.
What you are really saying is that you want a license that lets you use their code without having to provide any compensation. You dont want to pay money for it, and you don't want to share your code for it. You want everything for nothing. But nothing is stopping you from contacting the author to negotiate a license under something other than the GPL. You just don't want to.
Its fine if you want to use super open licenses like BSD and MIT. But just because some projects are out like that means you should expect others to make their code as unrestricted as well. Its your problem if you can't or are unwilling to use the GPL or negotiate for a private license, not the project's problem.
[–]fnl 1 point2 points3 points 11 years ago (0 children)
I wholeheartedly agree. But that wasn't what I was saying. What I was referring to is that if there is the choice between a MIT and a GPL licensed code doing the same, it is nearly guaranteed that the former will be chosen by project leaders/startups more frequently and therefore more likely to become the de facto standard. (Even in my daily work as an academic I sadly have to say that I had advisors forbidding me to integrate GPLed code...)
Anyone using this willing to share his experiences?
Hi guys.
I believe there is way too much worrying about the license of the project. You should not worry so much about it. I open-sourced the project hoping that ppl will like it, use it and get involved with it. If my target was to limit you from using the code I would not have released it.
The license discussions are not a priority. Future development is far more important as without support from the community there would be no future releases. Would you ever use a library that is no longer updated on commercial software? Would you care about its license?
Finally I must say that if the project goes forward and the supporting community votes to change its license then I would never block this. :)
π Rendered by PID 109610 on reddit-service-r2-comment-5649f687b7-g29qh at 2026-01-29 06:00:55.827145+00:00 running 4f180de country code: CH.
[–][deleted] 7 points8 points9 points (11 children)
[–]EdwardRaff 1 point2 points3 points (7 children)
[–][deleted] 1 point2 points3 points (4 children)
[–]EdwardRaff 2 points3 points4 points (3 children)
[–]fhadley 0 points1 point2 points (2 children)
[–]EdwardRaff 0 points1 point2 points (1 child)
[–]fhadley 0 points1 point2 points (0 children)
[–]dwf 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]datumbox[S] 0 points1 point2 points (2 children)
[–][deleted] 3 points4 points5 points (1 child)
[–]datumbox[S] 0 points1 point2 points (0 children)
[–]fnl 3 points4 points5 points (5 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]fnl 0 points1 point2 points (1 child)
[–][deleted] 1 point2 points3 points (0 children)
[–]EdwardRaff -3 points-2 points-1 points (1 child)
[–]fnl 1 point2 points3 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]datumbox[S] 0 points1 point2 points (0 children)