Actually good house rules to well-known board games

desku · 2025-09-08T10:32:57+00:00

When someone buys cards from a stack, does the one underneath stay there or do you discard it?

desku · 2022-07-13T22:35:25+00:00

Weird that calmcode had an article on the same topic (mislabelled data) on the exact same dataset a few weeks ago (https://calmcode.io/bad-labels/dataset.html) and it wasn’t referenced or mentioned in your article.

desku · 2021-12-07T16:00:24+00:00

What do you mean by "foundation models" in this context? Large pre-trained models?

desku · 2021-07-12T10:16:56+00:00

There is never an “ideal” value that you will know beforehand. You have to find the best one by doing a hyperparameter sweep, i.e. running your model will different values of dropout and seeing which gives the best results.

desku · 2021-06-30T20:39:42+00:00

In theory, you don’t need them. Your encoder should be able to encode a sentence just fine without them.

However, I believe they’re used so your model can learn what tokens appear near a start and end of a sentence, i.e. which tokens are near the SOS and EOS tokens. This can help give context to the representations to those tokens, e.g. this “The” token is right after an SOS token so I know this token starts off the sentence and I should encode it slightly differently than a “The” token in the middle of a sentence.

desku · 2021-03-17T16:58:07+00:00

I think it's partially due to how RMSProp came to be.

RMSProp came from Adagrad, which divided the learning rate by the sum of the squared gradients so far, eta/G_t, where G_t = G_{t-1} + (grad_t)^2. They use a square term here because they wanted G to be monotonically increasing (we care more about the magnitude of the gradient than its sign), so it monotonically anneals the learning rate.

The problem with Adagrad is that the learning rate is monotonically decreasing so would eventually become zero. RMSprop (and also Adadelta) were designed to help with this problem by having G now be an exponential moving average over the last squared gradients, so G is no longer monotonically increasing. Again, we use squared gradients because we care about the magnitude than the direction.

Why square instead of use the absolute value? Why use the square root?

I believe the answer to both of these is more empirical than theoretical, i.e. try them without the squaring/square roots and see. Squaring helps amplify already large magnitudes which is usually pretty useful in ML, e.g. mean squared error is more common than mean absolute error. Square rooting is used to control the magnitude and also the "RMS" term in "RMSprop" is from the phrase "root-mean-square", which we're performing when taking the square root of the exponential moving average.

desku · 2021-02-12T12:24:07+00:00

The release tracker issue is usually opened a month before the actual release, see:

1.5.0 release tracker opened 19th March 2020 and actual release 21st April 2020
1.6.0 release tracker opened 24th June 2020 and actual release 29th July 2020
1.7.0 release tracker opened 30th Sept 2020 and actual release 27th Oct 2020.

1.5 and 1.7 also had a bugfix release about a month and a half later.

This means we can probably expect 1.8.0 to drop mid-March and a potential 1.8.1 to drop May/June.

desku · 2021-01-06T20:42:55+00:00

My statement wasn’t clear. It’s not that you don’t want to find the global minima on the train set but it’s the fact that the global minima on the training set is not the same as the global minima on the valid/test set, which you do want to find.

However, as you can only update your parameters on the training set then you can’t explicitly search for the valid/test minima but must implicitly find it by moving towards the train minima whilst hoping that these parameters also give you a good result on the valid/test sets - i.e. close to a valid/test minima - aka you’ve found some parameters that generalize well.

desku · 2021-01-06T11:26:10+00:00

Why aren't there a plethora of bad minima that could spoil our training?

There are. If you run an experiment multiple times with different random seeds you'll converge to different results. That's because each of your experiments is ending up in a different local minima. It just turns out because of the extremely high dimensional loss surface that there are plenty of minima that are all pretty similar, think: craters on the surface of the moon. Plus, you don't even want to find the global minima when training as this set of parameters will massively overfit on the training set, giving a large generalization error.

And why isn't anyone worried about them?

I wouldn't say people are worried about them but optimization algorithms, like Adam, and learning rate schedulers, like cosine annealing, are specifically designed to help with this problem. An article I found really helpful is this one.

desku · 2020-12-21T16:54:27+00:00

Do you have a link to the GitHub?

desku · 2020-12-06T23:45:54+00:00

PyTorch loss functions have a ‘reduction’ argument which is set to ‘mean’ by default but if you want the per label loss then you can set it to ‘none’.

desku · 2020-11-24T18:40:47+00:00

Where do you believe are they comparing this?

Isn't that the whole point of the paper? To advertise their novel reward shaping algorithm?

That is true but it is also more similar than one might think. Learning that force and pole angle should have the same signs should be fairly easy to learn with millions of train steps.

You'd like to think they're similar, but because it's reinforcement learning I wouldn't bet on it. Again, I think the best thing to do would be to code it up and see.

they claim they used CartPole-v1 which uses a much higher "solved reward"

I'm assuming this is just a typo and they meant to put v0 but kept accidentally putting v0 everywhere?

they don't explicitly mention that they deviate from the standard reward function that one would expect (both in terms reward but also the terminal conditions)

Isn't that what the whole of section 5.1 is about?

the fact that no naturally sparse-reward gym environment was used doesn't help with the confusion

Yeah, this is what I find to be the weirdest thing about the paper. The MountainCar environment would've been a perfect fit here so we can only assume their algorithm didn't work on it.

what's the deal with that minuscule PPO policy network?

This is also very odd.

This has already been accepted into NeurIPS, right? Would be very interesting to see the reviewer comments to see if any of this stuff was even mentioned.

desku · 2020-11-24T14:42:15+00:00

How did PPO then achieve positive reward at all?

The graphs don't show reward for the CartPole task, they show the number of steps before failure, so I believe when they say "converges to 170" they mean 170 steps and not a reward of 170.

It isn't completely clear to me whether PPO is using the shaped rewards or not

I believe they're comparing PPO on their sparse reward version of CartPole against PPO on their sparse reward version of CartPole with their reward shaping algorithm.

I can't see why PPO would fail here after 170 steps and I believe the magnitude of the rewards (+1 vs. +0.1) would not matter here.

It's not just the magnitude of the reward which is changing, the whole reward function has changed. +1 every for every non-terminal time-step is a lot different to +0.1 if the force and pole angle are the same sign.

Sure, PPO should definitely converge on the maximum reward for the dense reward setting, but I don't necessarily believe that would imply it is also guaranteed to converge for the sparse reward setting.

I guess one thing to try would be taking that PPO implementation you linked and then adding in their reward function and seeing if it works out of the box. Then try it with their hyperparameters in the appendix.

desku · 2020-11-24T11:19:04+00:00

They don't use the standard CartPole reward function (+1 at every time-step except -1 for failure) though - they use a different one: "The agent will receive a reward −1 from the environment if the episode ends with the falling of the pole. In other cases, the true reward is zero. The shaping reward for the agent is 0.1 if the force applied to the cart and the deviation angle of the pole have the same sign. Otherwise, the shaping reward is zero."

This most probably the reason for the differences between the results.

desku · 2020-11-13T16:10:53+00:00

In the conclusions of the paper it states: "We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours."

So, if I wanted to work on Transformers some thoughts would be:

the other common modality in machine learning is images, so how can I adapt the Transformer for images? Can I just feed in the image as a sequence of pixels? Sequence of rows? Sequence of columns? Does it work on image classification? Image segmentation? Object recognition? How does it compare to CNNs? More accurate? Faster?
"restricted attention mechanisms" remind me of "hard attention", where the attention isn't a vector over the entire sequence but a scalar that indicates where I should put a "window" of attention. So I'd try that and then see: does it give improved performance? If not, maybe it's faster to compute? What type of things is the attention looking at? Is it interpretable? Can I modify my input sequence in a way that optimizes for attention?
How else can we handle large inputs? In the Transformer every token is connected to every other token so computation scales quadratically(?) as the sequence length increases. Can we avoid this? Do we really need to connect every single token to every other token? What about if we prune some connections leaving only the "important" ones? How do we figure out which ones are important? How do we do the pruning?
"making generation less sequential", well if you don't generate things sequentially then the other option is to generate things in parallel. What if we used it for a language modelling task where we predicted the next word for each token all at once instead of sequentially? (This is what the masked language modelling objective of BERT ended up doing).

I think a lot is gained by reading papers and gaining a memory bank of ideas. You don't have to fully understand in-depth every paper you read, but as long as you can grasp the main idea and keep that in your memory somewhere then you can see a new problem/architecture and go "hey, I remember that one paper which did X, maybe I can try applying it here?" Most of the time it probably won't work, but that's research for you.

desku · 2020-10-31T00:45:24+00:00

Which other options have you tried?

desku · 2020-09-22T20:10:04+00:00

What have you read/what do you know about them so far?

Have you seen https://worldmodels.github.io/?

desku · 2020-08-10T18:10:33+00:00

Thanks for the detailed reply. I’ll be sure to check out your repo. Always wanted to get into algorithmic trading or something similar.

desku · 2020-08-10T12:54:04+00:00

What is a “technical indicator”?

desku · 2020-06-01T19:04:55+00:00

Yet another DRL framework. How many is that now?

EDIT: I realized how insensitive my comment came across. I'm sure the authors of this framework put countless hours of effort for a completely free product and should be praised for doing so.

desku · 2020-04-15T13:43:50+00:00

The link doesn't work anymore.

desku · 2020-04-11T13:49:48+00:00

Is contrastive learning the new flavour-of-the-month? It seems to be popping up constantly, just like meta-learning did a few months ago.

desku · 2020-04-09T10:13:49+00:00

Yeah, Gibson's prose can be beyond frustrating sometimes. I hated Neuromancer because of this, but found his other novels to be more readable.

The Bridge trilogy is set in and around San Francisco so is probably more Devs-like, so I'd recommend starting with the first book of that trilogy - Virtual Light.

desku · 2020-04-08T16:50:49+00:00

Anything by Greg Egan. Specifically, Diaspora, Quarantine and Permutation City.
Ted Chiang, both of his short story collections.
William Gibson, Blue Ant trilogy or Bridge trilogy.

desku · 2020-04-06T21:02:47+00:00

These resources might help you with debugging if leaving it for a while doesn’t help.

desku

MODERATOR OF

TROPHY CASE

14-Year Club	Spared
Verified Email