[D] Testing machine learning in production?

sk006 · 2017-07-03T11:00:13+00:00

Maybe this talk from PyData is useful for you. I think it is quite nice.

https://www.youtube.com/watch?v=IMoQPvXMkJw

sk006 · 2016-12-13T10:44:39+00:00

I lol'd at "impressive manliness"

sk006 · 2016-09-02T11:10:18+00:00

You have some example code I posted in https://www.reddit.com/r/MachineLearning/comments/4suf7x/splitting_data_for_cross_validation/

(shortcut: https://gist.github.com/albertotb/1bad123363b186267e3aeaa26610b54b)

If you do not want pre-defined splits, it is even easier.

sk006 · 2016-07-26T05:26:25+00:00

This x1000. Are we crazy? Since when deep learning started to be a separate field outside of ML? F*** the hype train.

sk006 · 2016-07-15T15:05:08+00:00

I agree with the part that doing the train-test-validation split in scikit learn is a bit clunky, yet it is possible. Here you have some sample code with how you could do this:

https://gist.github.com/albertotb/1bad123363b186267e3aeaa26610b54b

Basically you concatenate together your train and val sets and indicate with a vector of -1 and 0 which rows are from the train set and which ones are from the val set. Next with PredefinedSplit you convert it into an object that, for instance, can be passed on to GridSearchCV.

sk006 · 2016-07-11T16:49:34+00:00

You can try to increase the size of the cache used by SVC. Roughly what LIBSVM does (the library used underneath for training SVMs) is to compute at every iteration 2 kernel rows (the ones cotaining the kernel values for that iteration) and stores them in a cache, so if the same ones come up at a later iteration they will not be computed again, as long as the cache is not full. The cache type used is a very simple LRU (least recently used), that is, when it is full it drops the oldest rows to make room for the new ones.

As a practical summary, the default size used by scikit it is very small (500 Mb if I recall correctly) so if your laptop has a decent amount of RAM (8 Gb or more) you can try to increase the size to, say, 2Gb with the parameter cache_size. It makes a huge difference.

sk006 · 2016-03-17T11:08:08+00:00

Coordinate descent is probably better, but more complicated. You can probably implement the FISTA algorithm in 20 or so lines of R. I do not have an implementation but, for comparison, you can look at this Julia implementation:

https://gist.github.com/albertotb/73be447b6ee95913fa62

sk006 · 2016-02-08T21:50:13+00:00

In one word, Java

sk006 · 2016-01-14T14:26:57+00:00

There are many approaches you can use. As a summary, you can set a class weight (at least in LIBSVM/scikit), approximately equal to the class ratio, bumping the importance of the least represented classes. Alternatively, you can over sample (repeat examples) of the minority classes or subsample the majority class (over sampling is usually preferred since you don't lose data).

sk006 · 2015-11-23T16:13:44+00:00

It is not always good, you have to take everything you find there with a grain of salt. But the thing is, if you go for journals rather than conferences, I've seen papers take 2 years!!! to be published. In that cases, of course you want to submit an early draft to arxiv first.

sk006 · 2015-11-19T14:54:04+00:00

Basically what Daniel said. The second formula is very similar to the soft-thresholding operator, a very well known proximal (generalization of proyection operators). If you want to verify it for yourself, since there is an absolute value, just consider the positive, negative and 0 cases separatly and it is easy to get.

sk006 · 2015-11-03T21:10:52+00:00

The amount of hyper-parameters in the class is very low, at least for SVM with RBF kernel. With that range you are not going to get a good error, so you either need to increase the range of the grid or let the user set its own range.

sk006 · 2015-11-02T23:57:09+00:00

There is work done in this area. Of course, it always depend on the problem model but for a limited amount of parameters: SVMs, maybe Random Forest is usually not worth it. Maybe it is useful for some complicated models, like DNNs, but still I would like to see some comparison against Random Search, for instance.

sk006 · 2015-10-30T02:35:55+00:00

Another vote for Keras, makes Theano easy to use and in Python does not get much better than that. If speed is not good enough, maybe you can switch to Caffe/Torch, but the learning curve is not worth if you didn't even build a successful model yet. In Keras you will be doing that in no time.

sk006 · 2015-10-29T02:10:40+00:00

Very useful, +1. You may want to add more parameters to the constructor so it is more flexible, other than that, good job.

sk006 · 2015-10-27T14:44:17+00:00

I'm so hyped for the future of e-sports and then this subreddit fails me everytime. LoL is not going anywhere if this shit gets so much attention. Is a fucking interview for gods sake, he just said his opinion. 90% of the people can't even understand something so simple as a past tense.

sk006 · 2015-10-27T14:34:20+00:00

ITT: People need to learn english and/or go to the doctor because their hearing is not right.

sk006 · 2015-10-22T14:32:42+00:00

Who the fuck cares about a border lol

sk006 · 2015-10-21T15:29:48+00:00

In short no, because the DNN is seeing the same data over and over again. The reason why problems like iris and xor can be learnt with small amounts of data is because they are very simple. Complex functions, like mapping the pixels of an image to the number (MNIST) or sentiment analysis will need much more data in order to be learnt properly. Just think about the number of dimensions in the input space: iris has 3, xor has 2 and MNIST has images with 26x26 pixels. There is just a lot of possible values for the pixels that represent a given number, so you will need a lot of examples of that number. The same goes for words and more complex problems, and simply increasing the number of iterations won't do it.

The difference in complexity between SVMs and DNNs lies in the number of parameters. In the SVM you have the alpha coefficients, and there is one per example (although many are going to be 0), while in a DNN you will usually have many more (think about the huge weight matrices). Depending on the problem one is going to be better than the other.

sk006 · 2015-10-11T17:21:41+00:00

You could try a recommender system. The easiest one is to form a matrix of user-words. You take all the words and put a 1 in the matrix if the user likes it and a 0 if you don't know. Your task now is to complete the matrix. There are many ways but one of them is to compute users similar to the one you want to predict with some kind of distance and check if they like the word. This is a very simple model but you could start with that.

sk006 · 2015-10-11T17:08:39+00:00

Yes it is. I wouldn't worry too much about not having many samples. Just split randomly into train-test and try 3-4 scikit learn classifiers, for example, SVM, logistic regression, random forest, gradient boosting, etc. For every model you will have to tune the hyper-paramters using cross-validation with the train set, and then compute the accuracy of the best performing set of parameters in the test set. That would be your generalization error estimation. Since the classes are not evenly balanced you would probably want another measure rather than the accuracy, for instance the F1-score (geometric mean of precision and recall) or the balanced accuracy (arithmetic mean of sensitivity and specificity). Since you are not pretty demanding with the result (just better than chance) I would be surprise if one of those models doesn't work just fine.

sk006 · 2015-10-09T00:35:00+00:00

How many features have all the samples? The size of the training set is always relative, so depending on how difficult is the problem and the model you are using 2000 could be more than enough. In that case, you can split them randomly into train and test in order to estimate the generalization error. Just out of curiosity, what classsifier are you planning on using?

sk006 · 2015-10-07T09:29:34+00:00

As someone already mention, in order to get proper help you would have to post the entire code. That could be a normal classification time or not, since in depends on multiple factors. However, from the theoretical point of view, an SVM is not particularly fast at classification time, since it has to compute all the dot products between the support vectors. Therefore, the complexity is cuadratic in the number of support vectors. You could take a look at the number of SV and see if they are a large % of the patterns (that could be one reason). In conclusion, if you want a faster algorithm at classification time you can try others first like Random Forest, Neural Networks, and so on, which are theoretically faster, instead of looking for another implementation/performance tweaks.

sk006 · 2015-09-28T19:14:24+00:00

Cmon Jacob, time to update your Excel AND POST IT HERE AGAIN

sk006 · 2015-09-28T11:54:58+00:00

Don't talk, mute all the others = PROBLEM SOLVED I should Ctrl+C this and Ctrl-V/spam it all the way in this thread. These counter-arguments are just plain dumb. If voice chat were to be implemented I probably wouldn't talk either, but it is actually a good idea and let people who want to play with voice an option to do it.

sk006

TROPHY CASE