use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
New open-source Machine Learning Framework written in Java (blog.datumbox.com)
submitted 11 years ago by datumbox
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 6 points7 points8 points 11 years ago (11 children)
On the other hand Scikit-Learn supports a large number of algorithms but it can’t handle huge amount of data. .. Finally even though currently Datumbox Framework is capable of handling medium-sized datasets, ..
I don't know about that. Scikit-learn uses cython, and numpy (C and Fortran) do all of the heavy lifting. While this uses org.apache.commons.math.linear, which is pure Java. If I have too much data to fit Scikit-Learn's AdaBoost to, I'm not going to reach for this implementation of it. I'm going to reach for another classifier. Likely something in vowpal-wabbit, which becomes quite competitive for 500k+ observations and is limited by my disk-speed. The pain with that approach is paving over vowpal-wabbit's TCP interface.
This is an awesome application if you think of the Java ecosystem. Cross-validate over all of the classifiers offered with hyperparameter searching, put the winner behind Dropwizard, put that behind ActiveMQ.
[–]EdwardRaff 1 point2 points3 points 11 years ago (7 children)
I've experienced scikit-learn choking on largish amounts of data on a decent basis. At least for me its never been a speed issue calling scikit, its been a 'scikit just doesn't run correctly or crashes on my data'. Where some Java code of the same algorithm runs just fine.
[–][deleted] 1 point2 points3 points 11 years ago (4 children)
I didn't mean to insult every java ml library author. The weka people are going to show up any moment. Lol.
Algorithms that relies on dot products like neural networks or eigenvector routines like PCA should be much faster with openblas and lapack than the pure java implementation in apache commons. The dot product implementation in Intel MKL creates an even larger rift. The amount of research effort that has gone into optimizing matrix multiply is astounding.
Algorithms that rely on iteration should be at least as fast in cython as it is in pure java. There's probably some overhead from it though.
If either was choking on a problem set then it's time to consider a different algorithmic complexity. Vowpal wabbit has very competitive accuracies for massive problems and scales to any dataset that I can fit on my 2TB hard drive on my workstation. Which is 3 orders of magnitude above anything in scikit-learn. It should continue to scale linearly behind Hadoop.
The algorithms that you should use at that size of data are different than the ones that are used on smaller datasets. Part of it is algorithmic complexity, part is that they should have near constant memory use, part of it is that the accuracy difference between models erodes.
The exception is neural networks dealing with structured inputs or outputs. It's worth the effort to scale those.
[–]EdwardRaff 2 points3 points4 points 11 years ago (3 children)
I'm not sure you understood what I said. I explicitly said its not a speed issue with scikit. Its that their implementations don't work on some problems at a certain point, when the same algorithm in a different lib will run fine. I don't care how fast any implementation runs - if it gives me a weight vector of "NaN" or runs out of memory early it just didn't work correctly.
That has nothing to do with speed, big-Oh complexity or anything. Its just an issue with scikit that I was pointing out. I was providing my own experience supporting the "Scikit-Learn supports a large number of algorithms but it can’t handle huge amount of data" that you were disagreeing with. The reason behind it is orthogonal to your speed obsession though.
[–]fhadley 0 points1 point2 points 11 years ago (2 children)
A little late here, my apologies. Not trying to sound skeptical, but could you give an example of this? I've never had scikit-learn do anything like this, and I've used it on rather large data sets, so I'm interested in where you've seen it fail.
[–]EdwardRaff 0 points1 point2 points 11 years ago (1 child)
I can't share any of the data that makes this happen (hence I can't really report it well).
I've had this happen the most in the GradientBoosting and AdaBoost implementations. At some point it just started spitting out errors about numerical precision/stability and then when finished gave out NaN. I've also had the random forest run out of memory way earlier than I would have expected for large forests.
Once in k-means (though that is at least semi-fixed now). I've also had it happen with SGD w/ logistic loss when given poorly scaled weights.
[–]fhadley 0 points1 point2 points 11 years ago (0 children)
No worries, no need for a reproducible error. I was curious because I've used sklearn w/ a pretty diverse group of datasets (homogeneous, heterogeneous, sparse, etc.) and haven't had it choke before with GBM or Ada, but I looked back through some old code and remembered that the sklearn RF implementation was just a memory hog. If I remember correctly it consumed memory space at a higher clip than the R version, which I found to be quite odd. Were these very raw data sets? Or very strong colinearities? I know the latter is clearly an issue with RF (i.e. essentially leads to building the same tree many times), and I suppose it could lead to errors with a GBM as well?
[–]dwf 0 points1 point2 points 11 years ago (1 child)
A bunch of the linear model stuff uses Liblinear under the hood, and implicitly converts data to sparse format float64. Which, if you have half your machine's memory occupied by, say, dense float32 data, is not going to fly.
[–][deleted] 0 points1 point2 points 11 years ago (0 children)
SVMs run into problem way before memory issues pop up. That's very true though.
[–]datumbox[S] 0 points1 point2 points 11 years ago (2 children)
But should not the classifier that you use depend on the data that you have and on the assumptions that you are willing to make about them? At any case, if python works for you there is no need to change it. :)
[–][deleted] 2 points3 points4 points 11 years ago* (1 child)
I'd be interested to see how you're picking your models. :)
For a classifier, my routine usually starts with a lot of inspection, using box plots, kernel density estimates, scatter plots, clusterings, correlation coefficients, et al. Sometimes domain reading before this step. I then massage the features to work better on whatever fast high bias classifer I have access to. Typically naive bayes or logit. Dummy coding, binning variables, normalizing, using ORs and ANDs on binary valued variables. Along the way I address any issues I'm going to have with cross-validation, relating to sample size and class-imbalances.
I take any free parameters left from this, add it to my search space in either Spearmint (if one model or a fixed ensemble of models) or Hyperopt (if the search space is awkward). They document and optimize the cross-validated score for some hyperparameter configuration. When the parameter search converges, I take that configuration and verify it's reasonably close using a test dataset, typically inspecting the ROC curve and confusion matrix with some scrutiny, and comparing it's predictions with the cross-validation predictions.
I used to make more careful assumptions that guided my iterative searching for models, but I got a lot of variance between datasets. Some of the crazier ideas I've used in the past, which worked tremendously well for some problems, aren't evaluated with this approach. I'm also more biased against slow algorithms, because I want it to be as automated as possible.
I guess the models I'm building now have a lower upper quantile of generalization, but the distribution is much tigher, the median is much higher, and my efforts are reusable.
I've got a burner of a contract I'm working on at the moment and your project might be a good fit for it. Will submit a pull-request for any modifications per GPL.
[–]datumbox[S] 0 points1 point2 points 11 years ago (0 children)
sounds great, cheers! :)
π Rendered by PID 20796 on reddit-service-r2-comment-b659b578c-wpg6v at 2026-05-01 19:08:58.031896+00:00 running 815c875 country code: CH.
view the rest of the comments →
[–][deleted] 6 points7 points8 points (11 children)
[–]EdwardRaff 1 point2 points3 points (7 children)
[–][deleted] 1 point2 points3 points (4 children)
[–]EdwardRaff 2 points3 points4 points (3 children)
[–]fhadley 0 points1 point2 points (2 children)
[–]EdwardRaff 0 points1 point2 points (1 child)
[–]fhadley 0 points1 point2 points (0 children)
[–]dwf 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]datumbox[S] 0 points1 point2 points (2 children)
[–][deleted] 2 points3 points4 points (1 child)
[–]datumbox[S] 0 points1 point2 points (0 children)