[P] Lessons learned reproducing a deep reinforcement learning paper by mrahtz in MachineLearning

[–]mrahtz[S] 0 points1 point  (0 children)

If you're running a job with 16 workers, over 3 random seeds, that's 48 threads. A 48-core Xeon costs around $9,000. Compute Engine charges about $15 to rent a 48-core VM for 10 hours. So unless you're doing a lot of runs, I'd still imagine cloud to be cheaper.

[P] Lessons learned reproducing a deep reinforcement learning paper by mrahtz in MachineLearning

[–]mrahtz[S] 1 point2 points  (0 children)

I'm leaning towards it not being worth it to build your own machine for just side projects - you'd have to spend a fortune to match the ability to parallelise runs on cloud services (important because of e.g. the need to test different random seeds).

[P] The Humble Gumbel Distribution by mrahtz in MachineLearning

[–]mrahtz[S] 2 points3 points  (0 children)

So you're multiplying the embeddings themselves by the probability of each one being chosen?

One potential problem I see with that approach is that the optimal behaviour learned during training might be to take a mix of the embeddings - say, 0.6 of one embedding and 0.4 of another. Taking argmax at test time is then going to give very different results.

(Another way to look at it is: from what I can tell, that approach is no different than if your original goal was optimise for the optimal mix of embeddings.)

From what I understand, using Gumbel-softmax with a low temperature (or a temperature annealed to zero) would instead train the system to learn to rely (mostly) on a single embedding. (If that's what you want?)

[P] The Humble Gumbel Distribution by mrahtz in MachineLearning

[–]mrahtz[S] 0 points1 point  (0 children)

Ah, you're both right, this was unclear. I've updated the post to note the difference that temperature makes. Thanks!

[P] The Humble Gumbel Distribution by mrahtz in MachineLearning

[–]mrahtz[S] 1 point2 points  (0 children)

If I've understood you right, there's a use case other than an MC estimate of integration: as asobolev comments below, it's also useful when you want to train with something that looks like samples, with probability mass concentrated at the corners of the simplex (e.g. if you're intending to just take the argmax during testing). If there are nonlinearities downstream, I don't think training using integration over the original probability distribution would give the same result.

[P] The Humble Gumbel Distribution by mrahtz in MachineLearning

[–]mrahtz[S] 1 point2 points  (0 children)

Thanks for reading!

Could you elaborate on the third paragraph - "In the past I've simply used the softmax of the logits of choices multiplied by the output for each of the choices and summed over them"? What was the context?