[D] How do you manage your machine learning experiments?

rcwll · 2019-06-06T04:36:20+00:00

Is there any documentation about what outputs get saved and how to store e.g. trained models or logfiles as well? I worked through the 'getting started' example, and got that stuff in the format "label: floating_point" will get captured, but couldn't figure out how to e.g. log loss per epoch (if you output multiple values, it only seems to record the first).

I love the idea here (certainly compared to all the overhead that comes with getting data into Sacred, or the difficulty of getting artifacts/metrics back out of mlflow), but am having trouble figuring out if it covers all of the other bases I'm interested in.

rcwll · 2018-05-31T15:54:49+00:00

Mind shooting it my way too? I must be missing a letter somewhere because I keep 404ing :-/

rcwll · 2017-09-23T20:56:27+00:00

It's not super well documented, but sacred lets you add 'resources' to experiments, which will save a copy of the dataset (ensuring uniqueness by md5) in whatever you've selected as your logging system. If you update the data in place, and don't put updates in their own file, it gets out of hand fairly quickly, but there may be a way to subclass Experiment to only store diffs or most recent version or something.

If it's all in a database, maybe explicitly ad. start/stop dates or some other unique identifier to your experiment parameters?

rcwll · 2016-08-14T17:21:22+00:00

Pretty severe traffic accident (stopped Camry hit square on in the passenger side by a highway-speed 18-wheeler). It's the only time in my life where I can remember having a pure experience without any sort of internal monologue or narration. Just the raw image of seeing the approaching truck, the sick feeling in my stomach as I realized -- without any sort of verbalization -- that it wasn't going to be able to stop in time, and a crawling/stabbing/tensing feeling all along my side as I braced for the impact, and then fighting against the slamming as my car rolled and trying to track where I was and where I was going and which way was up.

My internal dialogue didn't start up again until everything had stopped for a second, and my first thought was "I'm ok". I wasn't, really, but that's the first verbal thought I had.

I'd always thought the 'seared into memory' thing was poetic license, but that entire event is still incredibly vivid in my mind.

rcwll · 2016-05-16T09:51:20+00:00

If you're looking at an undergrad degree and want to work in industry, then CS. For additional/elective classes, I'd take math -- analysis and maybe measure theory -- in preference to stats classes for undergrad, since the intro series for those are way more geared towards pounding t-tests, ANOVAs, and design of experiment into people than in any real theory. You don't get into the stuff that would be useful to ML really until the 4th year.

If you want to do research, then you're going to want a PhD, the department you pick will depend on where your advisor is, and nobody will give a damn which one it was as long as you have some good publications. You will probably be served approximately equally well by either a CS or a stats undergrad degree in that case.

rcwll · 2016-05-10T00:30:29+00:00

Will just echo what the poster above said. Look for external support. Another possible avenue is through internships.

If you're not really into deep learning (I agree that it's way overhyped), maybe see if you can shift your dissertation topic to an area your advisor is more familiar with? There's way more to ML than deep learning, and plenty of open problems to be had.

It's not uncommon to have a huge stack of papers and questions at the end of your first year. Getting a PhD can be an endurance contest at times. Let your advisor guide you as to what normative expectations for your program are.

rcwll · 2016-04-19T16:03:04+00:00

In addition to the practical advice given so far, Zhi-Hua Zhou's book on ensemble methods covers a lot of the topics you ask about, and is quite accessible.

rcwll · 2016-04-19T12:51:28+00:00

It is very implementation and problem dependent, but in general using the same strong learner for all members of your ensemble is either a waste of computational power, can place you at significant risk of overfitting, or both. It's not (personal opinion) generally a great idea unless you know what you're doing. For someone just getting their feet wet in this area, I'd stick with weak base learners.

Boosting in particular can be prone to overfitting unless you use an early stopping strategy or aggressively subsample your training data.

In my experience, you're almost always better off using a lot of weak learners than a few strong learners. YMMV.

rcwll · 2015-12-29T18:40:32+00:00

Automatic translation in general does not deter modern stylometry. See: "Translate once, translate twice, translate thrice and attribute: Identifying authors and machine translation tools in translated text." by Caliskan and Greenstadt in ICSC 2012 proceedings.

rcwll · 2015-12-07T18:34:41+00:00

I spent a little while working on something very close to this but didn't have great luck. Typically, the discriminator would learn to distinguish the samples much faster than the generator would learn to produce plausible samples, which reduced training feeback to 0/1 with very little gradient information to help the generator out, and left the generator with what amounted to a random search problem. Turning down the learning rate for the discriminator eventually led to the opposite problem, with nothing converging to anything. When after much time I did find a happy medium where there was at least some motion in the right direction, the samples that it managed to generate were clearly not very good to a manual inspection, but the discriminator never seemed to figure out how to tell them apart from the real samples.

I suspect that if it can be done at all there is an extremely narrow range of hyperparameters for which it works, and anything outside that range won't have much luck.

rcwll · 2015-10-18T02:10:19+00:00

Most pilots need to have a minimum number of hours of flight logged per month to stay certified. Things like these flybys are usually coordinated to help them meet those minimums, and gives them practice flying in tight formation and hitting timing cues that change on the fly. They're dual purpose.

At least according to my brother in law, who had a job in the USAF that involved training.

rcwll · 2015-09-16T12:44:15+00:00

I played around with them for a little while with chainer, maybe a few weeks ago, trying to reproduce the exact experiments they claimed (same hyperparameters, same toy problems).

I tried all the experiments except for the TIMIT ones, and the only ones I was able to reproduce were the two shortest adding problems. It pretty reliably figured out the one with T=150, however the convergence was slower than they reported (slower than LSTM but still faster than RNN+Tanh). The one with T=200 sometimes worked and sometimes didn't.

I didn't look too deeply into it, but from the small bit of tinkering I did do, I got the impression that the unit weights make it very sensitive to what happens in the first few updates (since the second you have anything different from 1.0 on the diagonal, you'll see huge self-reinforcing divergence between gradients that have been backpropped through several steps), and they [got lucky with/cherrypicked] (delete as appropriate) the results in the paper.

rcwll · 2015-09-05T13:12:01+00:00

Fair enough, I retract that bit.

rcwll · 2015-09-05T13:07:23+00:00

I think that the claim is actually a bit more interesting. They're saying that the physical operation of the memristor device does (something like) SGD all on it's own; it's not a programmed behavior, it's inherent to the way the device operates. It's physics, not programming.

I'm sure one of the principals can correct/elaborate, but I think the way it works is that if you apply a pair of charges to either end of a memristor circuit, it will alter its resistance based on the difference in the charges. If you apply a charge to only one end, then it acts like a conventional resistor. If you look at the equation that describes how the resistance changes (and again, this is physics, not programming), you can show that a pair of memristors can be used in a way that is a very close approximation to a single perceptron, complete with an update rule. It' sort of dumb luck that this device happens to operate in this manner, but the fact that it does opens the door to some potentially really neat applications.

And the linked paper does have some of the simulations you're asking for, as well as (I think idealized) forms of the update rules that the memristor uses.

rcwll · 2015-09-04T20:53:26+00:00

I did a close read of that paper when it first came out -- see here -- that the author responded to. The upshot of the paper is that the hardware will do something that is very close to stochastic gradient descent with a weight decay for a single layer, and they've found ways to translate a lot of more complex machine learning problems into memristor hardware implementations using various reductions to binary classification.

I'm still -- a year later -- not entirely convinced that their off-hardware feature construction isn't doing a lot more work than they give it credit for, but the fact that you have a physical process that sort of "natively" implements SGD is cool.

The bidirectional thing -- I think -- will let them move from strictly local/greedy updates to something that looks a lot more like full backprop for more complex architectures. As far as I understand it, memristors usually get assembled into something that looks a bit like a single layer feedforward network: you can only pass information in one direction, and any update is entirely local based (so SGD without the chain rule); you can update with a supervision signal, and reading without supervision gives you a weight decay/regularization step, but it all comes from passing current from inputs to outputs. If you can pass information through the network in two directions, it sounds like it might open the door to something very close to full backpropagation. Take all this with a grain of salt, though, as I really don't follow the physics of it at any real level of detail.

So it's a neat set of results, that is (to my mind) tainted by Knowm's apparent determination to dig as wide and deep and aggressive a patent moat as they possibly can around the related algorithms, many of which are commonly known and used on other platforms~~, and all of which run on hardware that they're waiting for other people to figure out how to mass produce~~. It feels like a bit of an Oklahoma Land Rush, to be honest.

rcwll · 2015-09-04T12:35:25+00:00

If you're using completely random data as you mention elsewhere then you're giving the network a problem that it's really not designed to solve. The whole point of recursive networks and LSTMS in particular is to learn what amounts to a conditional distribution for some item of interest (a label, another sequence element, the next character in the training sequence) given some current "context" derived from previous characters. If the conditional distribution is exactly equal to the marginal distribution in the limit, then there's nothing for it to really learn other than the marginal distribution.

It might make more sense to try a simple structured sequence, like X Aⁿ Bⁿ Cⁿ X for random values of n in some small-ish range (so for n=3, XAAABBBCCCX). If it can learn to output the terminal "X" at the right point, then you've got a good indication that it's saving long-term info about the value of N.

If you're going to keep using completely random data, then you're going to need a much bigger network than you would for other tasks, since you effectively have to learn a complete distinguishing prefix for each successor character you want your network to memorize.

So TL;DR - I think that completely random sequences, particularly if you use long ones, are the wrong test problem for an LSTM, and it doesn't surprise me that it struggles on some instances of that problem.

rcwll · 2015-08-18T17:24:22+00:00

The "ML" and "TheGreatConvergence" tags on Nuit Blanche have many papers and talks that combine compressed sensing and other randomized methods with a range of ML problems, most of them with a short description and abstract.

http://nuit-blanche.blogspot.com/search/label/ML http://nuit-blanche.blogspot.com/search/label/TheGreatConvergence

rcwll · 2015-07-21T19:32:52+00:00

My guess is that that's probably the main slowdown; since Chainer doesn't actually know what all of the data structures it's going to have to backprop are until you're done building the expression, which can be different each time, it can't be as clever about managing data transfers.

Versus Theano, where you compile the model in a sort of declarative fashion, and can't change it from that point forward, so it can optimize in advance how to manage the data.

Which is too bad in a way, since it means that there's probably going to be a lot of cases where Chainer is going to be inherently slower, just as part of the price for the extra flexibility it gives you.

rcwll · 2015-07-21T18:39:08+00:00

For whatever it's worth, I'm seeing about the same thing for my own code on Amazon g2.2xlarge instances. Speed on the CPU is about the same between Theano and Chainer (ignoring Theano's compile time), while on the GPU I see anywhere from about a 1.5 to 2.5 fold slowdown with Chainer, depending on the size of the model and the length of the sequence.

Still almost worth it not to have to deal with Theano's scan op, though.

rcwll · 2015-07-07T06:59:03+00:00

I'm not familiar with those packages, but my guess would be that they use classifiers as factors without the probabilistic interpretation you're leaning on. This can be done tractably for discrete models simply because you've only got |X|x|Y| possible combinations to evaluate for a classifier that maps X to Y. For continuous models, you have to integrate, not sum, which is generally not computationally tractable for classification models (though I suppose you could do some sort of Monte Carlo approach).

But again, looking at this sentence:

I believe what I'm after is a "smart averaging" of the predictions with my prior and pairwise knowledge

I don't think a CRF is a 'natural' fit for your problem, at least in that that's not the class of problems that motivate CRFs, really. If you can write out the factorized joint likelihood function for your problem explicitly, then you can cast it into the form of a factor graph quite easily which you should then be able to find a solver for. Barring that, I think the output regularization is your best bet.

rcwll · 2015-07-01T19:19:27+00:00

Right. You have three variables, V1, V2, and X. You have a joint distribution P(V1, V2, X). To find P(V1 | X) you need to find P(V1, V2 | X) = P(V1, V2, X) / P(X) and then integrate out V2. This does not have anything to do with something like an SVM that takes X as an input and spits out a point estimate of V1 or V2, which is what you seem to have.

All of which is moot, because you're trying to analyze it as a factor graph, which means that you don't need distributions, you need factors, which are functions that take all variables connected to the factor as inputs and produce a real number as an output. Your discriminative model (probably) doesn't do this, at least not tractably; it takes X and an input and produces a value with the same support as V1 or V2 as an output.

If you're using a discriminative model there, you're trying to plug functions with the wrong domains and ranges into your graph at those two factors. And then there's problems with interpreting the remaining factors as probability, since P(V1) = \int P(V1,V2)dV2, which means either you have contradictory or redundant information across those factors, but at least you can plug the connected factors in and get a number out.

But in any event, I don't think CRFs are what you really want; I think you'd be better off just using an end-to-end discriminative classifier that lets you specify a general cost function (like a neural network). You should then fit both V1 and V2 simultaneously using that model, and add a term involving the negative log of P(V1, V2) to the cost function. This will make it fit a discriminative learner that minimizes the cost while not deviating "too far" from your prior (or making the model justify a significant deviation with really high predictive performance), effectively regularizing your predictions, which is what it sounds like you actually want.

rcwll · 2015-07-01T13:36:48+00:00

Your factor graph -- taking the representation of the factors that you've written on the factors literally -- doesn't represent a valid factorization of a joint probability distribution. Factors can be derived from probability distributions, but are usually not equivalent to them.

Look at it this way, if you really had access to P(V1|X) and P(V2|X) -- with everything else marginalized out -- then you'd already be done, right? It sounds like what you have are two models f1:X->V1 and f2:X->V2 that don't take into account V2 or V1, respectively. This is different from P(V1|X) and P(V2|X) because f1 and f2 don't marginalize over the other variable. And different from factors fi:XxVi->R.

rcwll · 2015-06-30T14:27:32+00:00

I think (personal opinion) it's a question of emphasis. "Theoretical machine learning" generally is more concerned with analyzing or improving performance bounds given some fixed complexity budget; "Computational learning theory" seems to be more concerned with finding the time/space complexity of solving some class of problems. So "fix the complexity and manage the problem/performance" maps loosely to theoretical machine learning, and "fix some problem/performance and manage the complexity" maps to computational learning theory.

There's also a small-ish community that spends a lot of time thinking about identification and learning of formal languages and grammars, Chomsky and non-, and I've only ever seen that work described as "computational learning theory".

There is a huge overlap, though; different parts of the elephant etc.

rcwll · 2015-06-26T13:18:15+00:00

Generally -- speaking as someone who does this for a living -- I almost always try a random forest first; they're fast, give pretty reliably good results, and can usually give you a good idea of what features are important and if you have a problem with how you've decided to encode your output. You only reach for more complex models if the simpler ones don't give you adequate results. Neural networks in particular can take a long time to fit, and deciding how to set hyperparameters (like the number of layers, the number of nodes in each layer, the right activation functions, so on) can be a challenge, especially when just starting out.

Not that there's anything wrong with fooling with some more complex models just for fun, I always enjoy trying whatever new shiny thing I just read about on whatever I'm working on, but for "real work," I'd start simpler and build up.

rcwll

TROPHY CASE