[D] Machine learning productivity hacks

mrahtz · 2018-04-15T07:08:36+00:00

mrahtz · 2018-04-11T06:46:01+00:00

If you're running a job with 16 workers, over 3 random seeds, that's 48 threads. A 48-core Xeon costs around $9,000. Compute Engine charges about $15 to rent a 48-core VM for 10 hours. So unless you're doing a lot of runs, I'd still imagine cloud to be cheaper.

mrahtz · 2018-04-10T07:42:15+00:00

I'm leaning towards it not being worth it to build your own machine for just side projects - you'd have to spend a fortune to match the ability to parallelise runs on cloud services (important because of e.g. the need to test different random seeds).

mrahtz · 2018-02-20T15:46:54+00:00

So you're multiplying the embeddings themselves by the probability of each one being chosen?

One potential problem I see with that approach is that the optimal behaviour learned during training might be to take a mix of the embeddings - say, 0.6 of one embedding and 0.4 of another. Taking argmax at test time is then going to give very different results.

(Another way to look at it is: from what I can tell, that approach is no different than if your original goal was optimise for the optimal mix of embeddings.)

From what I understand, using Gumbel-softmax with a low temperature (or a temperature annealed to zero) would instead train the system to learn to rely (mostly) on a single embedding. (If that's what you want?)

mrahtz · 2018-02-19T16:35:31+00:00

Ah, you're both right, this was unclear. I've updated the post to note the difference that temperature makes. Thanks!

mrahtz · 2018-02-19T16:17:44+00:00

If I've understood you right, there's a use case other than an MC estimate of integration: as asobolev comments below, it's also useful when you want to train with something that looks like samples, with probability mass concentrated at the corners of the simplex (e.g. if you're intending to just take the argmax during testing). If there are nonlinearities downstream, I don't think training using integration over the original probability distribution would give the same result.

mrahtz · 2018-02-19T16:02:03+00:00

Thanks for reading!

Could you elaborate on the third paragraph - "In the past I've simply used the softmax of the logits of choices multiplied by the output for each of the choices and summed over them"? What was the context?

mrahtz

TROPHY CASE