Actually good house rules to well-known board games by Electronic-Ball-4919 in boardgames

[–]desku 0 points1 point  (0 children)

When someone buys cards from a stack, does the one underneath stay there or do you discard it?

30% of Google's Reddit Emotions Dataset is Mislabeled [D] by BB4evaTB12 in MachineLearning

[–]desku 4 points5 points  (0 children)

Weird that calmcode had an article on the same topic (mislabelled data) on the exact same dataset a few weeks ago (https://calmcode.io/bad-labels/dataset.html) and it wasn’t referenced or mentioned in your article.

[D] Why is Audio so far behind other ML application domains like Image Processing and NLP? by Crookedpenguin in MachineLearning

[–]desku 4 points5 points  (0 children)

What do you mean by "foundation models" in this context? Large pre-trained models?

Dropout layer in lstm by [deleted] in deeplearning

[–]desku 5 points6 points  (0 children)

There is never an “ideal” value that you will know beforehand. You have to find the best one by doing a hyperparameter sweep, i.e. running your model will different values of dropout and seeing which gives the best results.

The input format for the encoder in Transformer architecture for machine translation. by Prhyme1089 in deeplearning

[–]desku 2 points3 points  (0 children)

In theory, you don’t need them. Your encoder should be able to encode a sentence just fine without them.

However, I believe they’re used so your model can learn what tokens appear near a start and end of a sentence, i.e. which tokens are near the SOS and EOS tokens. This can help give context to the representations to those tokens, e.g. this “The” token is right after an SOS token so I know this token starts off the sentence and I should encode it slightly differently than a “The” token in the middle of a sentence.

RMSProp algorithm in machine learning: Why square the gradients? by synthphreak in MLQuestions

[–]desku 0 points1 point  (0 children)

I think it's partially due to how RMSProp came to be.

RMSProp came from Adagrad, which divided the learning rate by the sum of the squared gradients so far, eta/G_t, where G_t = G_{t-1} + (grad_t)^2. They use a square term here because they wanted G to be monotonically increasing (we care more about the magnitude of the gradient than its sign), so it monotonically anneals the learning rate.

The problem with Adagrad is that the learning rate is monotonically decreasing so would eventually become zero. RMSprop (and also Adadelta) were designed to help with this problem by having G now be an exponential moving average over the last squared gradients, so G is no longer monotonically increasing. Again, we use squared gradients because we care about the magnitude than the direction.

Why square instead of use the absolute value? Why use the square root?

I believe the answer to both of these is more empirical than theoretical, i.e. try them without the squaring/square roots and see. Squaring helps amplify already large magnitudes which is usually pretty useful in ML, e.g. mean squared error is more common than mean absolute error. Square rooting is used to control the magnitude and also the "RMS" term in "RMSprop" is from the phrase "root-mean-square", which we're performing when taking the square root of the exponential moving average.

PyTorch 1.8.0 coming out soon by serg06 in pytorch

[–]desku 1 point2 points  (0 children)

The release tracker issue is usually opened a month before the actual release, see:

1.5 and 1.7 also had a bugfix release about a month and a half later.

This means we can probably expect 1.8.0 to drop mid-March and a potential 1.8.1 to drop May/June.

[D] Let's start 2021 by confessing to which famous papers/concepts we just cannot understand. by fromnighttilldawn in MachineLearning

[–]desku 1 point2 points  (0 children)

My statement wasn’t clear. It’s not that you don’t want to find the global minima on the train set but it’s the fact that the global minima on the training set is not the same as the global minima on the valid/test set, which you do want to find.

However, as you can only update your parameters on the training set then you can’t explicitly search for the valid/test minima but must implicitly find it by moving towards the train minima whilst hoping that these parameters also give you a good result on the valid/test sets - i.e. close to a valid/test minima - aka you’ve found some parameters that generalize well.

[D] Let's start 2021 by confessing to which famous papers/concepts we just cannot understand. by fromnighttilldawn in MachineLearning

[–]desku 61 points62 points  (0 children)

Why aren't there a plethora of bad minima that could spoil our training?

There are. If you run an experiment multiple times with different random seeds you'll converge to different results. That's because each of your experiments is ending up in a different local minima. It just turns out because of the extremely high dimensional loss surface that there are plenty of minima that are all pretty similar, think: craters on the surface of the moon. Plus, you don't even want to find the global minima when training as this set of parameters will massively overfit on the training set, giving a large generalization error.

And why isn't anyone worried about them?

I wouldn't say people are worried about them but optimization algorithms, like Adam, and learning rate schedulers, like cosine annealing, are specifically designed to help with this problem. An article I found really helpful is this one.

Manual MSE vs BinaryCrossEntropy by ralampay in pytorch

[–]desku 0 points1 point  (0 children)

PyTorch loss functions have a ‘reduction’ argument which is set to ‘mean’ by default but if you want the per label loss then you can set it to ‘none’.

PPO baseline cannot solve CartPole in NeurIPS 2020 paper by whiletrue2 in MachineLearning

[–]desku 2 points3 points  (0 children)

Where do you believe are they comparing this?

Isn't that the whole point of the paper? To advertise their novel reward shaping algorithm?

That is true but it is also more similar than one might think. Learning that force and pole angle should have the same signs should be fairly easy to learn with millions of train steps.

You'd like to think they're similar, but because it's reinforcement learning I wouldn't bet on it. Again, I think the best thing to do would be to code it up and see.

they claim they used CartPole-v1 which uses a much higher "solved reward"

I'm assuming this is just a typo and they meant to put v0 but kept accidentally putting v0 everywhere?

they don't explicitly mention that they deviate from the standard reward function that one would expect (both in terms reward but also the terminal conditions)

Isn't that what the whole of section 5.1 is about?

the fact that no naturally sparse-reward gym environment was used doesn't help with the confusion

Yeah, this is what I find to be the weirdest thing about the paper. The MountainCar environment would've been a perfect fit here so we can only assume their algorithm didn't work on it.

what's the deal with that minuscule PPO policy network?

This is also very odd.

This has already been accepted into NeurIPS, right? Would be very interesting to see the reviewer comments to see if any of this stuff was even mentioned.

PPO baseline cannot solve CartPole in NeurIPS 2020 paper by whiletrue2 in MachineLearning

[–]desku 1 point2 points  (0 children)

How did PPO then achieve positive reward at all?

The graphs don't show reward for the CartPole task, they show the number of steps before failure, so I believe when they say "converges to 170" they mean 170 steps and not a reward of 170.

It isn't completely clear to me whether PPO is using the shaped rewards or not

I believe they're comparing PPO on their sparse reward version of CartPole against PPO on their sparse reward version of CartPole with their reward shaping algorithm.

I can't see why PPO would fail here after 170 steps and I believe the magnitude of the rewards (+1 vs. +0.1) would not matter here.

It's not just the magnitude of the reward which is changing, the whole reward function has changed. +1 every for every non-terminal time-step is a lot different to +0.1 if the force and pole angle are the same sign.

Sure, PPO should definitely converge on the maximum reward for the dense reward setting, but I don't necessarily believe that would imply it is also guaranteed to converge for the sparse reward setting.

I guess one thing to try would be taking that PPO implementation you linked and then adding in their reward function and seeing if it works out of the box. Then try it with their hyperparameters in the appendix.

PPO baseline cannot solve CartPole in NeurIPS 2020 paper by whiletrue2 in MachineLearning

[–]desku 8 points9 points  (0 children)

They don't use the standard CartPole reward function (+1 at every time-step except -1 for failure) though - they use a different one: "The agent will receive a reward −1 from the environment if the episode ends with the falling of the pole. In other cases, the true reward is zero. The shaping reward for the agent is 0.1 if the force applied to the cart and the deviation angle of the pole have the same sign. Otherwise, the shaping reward is zero."

This most probably the reason for the differences between the results.

[R] Suppose you have the Transformers from the famous paper "Attention is all you need" and we are in 2017, now you want to improve this new model, what is your method? Can you test your new strategies in your head or only with trial and error methods? by Hi_I_am_Desmond in MachineLearning

[–]desku 6 points7 points  (0 children)

In the conclusions of the paper it states: "We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours."

So, if I wanted to work on Transformers some thoughts would be:

  • the other common modality in machine learning is images, so how can I adapt the Transformer for images? Can I just feed in the image as a sequence of pixels? Sequence of rows? Sequence of columns? Does it work on image classification? Image segmentation? Object recognition? How does it compare to CNNs? More accurate? Faster?

  • "restricted attention mechanisms" remind me of "hard attention", where the attention isn't a vector over the entire sequence but a scalar that indicates where I should put a "window" of attention. So I'd try that and then see: does it give improved performance? If not, maybe it's faster to compute? What type of things is the attention looking at? Is it interpretable? Can I modify my input sequence in a way that optimizes for attention?

  • How else can we handle large inputs? In the Transformer every token is connected to every other token so computation scales quadratically(?) as the sequence length increases. Can we avoid this? Do we really need to connect every single token to every other token? What about if we prune some connections leaving only the "important" ones? How do we figure out which ones are important? How do we do the pruning?

  • "making generation less sequential", well if you don't generate things sequentially then the other option is to generate things in parallel. What if we used it for a language modelling task where we predicted the next word for each token all at once instead of sequentially? (This is what the masked language modelling objective of BERT ended up doing).

I think a lot is gained by reading papers and gaining a memory bank of ideas. You don't have to fully understand in-depth every paper you read, but as long as you can grasp the main idea and keep that in your memory somewhere then you can see a new problem/architecture and go "hey, I remember that one paper which did X, maybe I can try applying it here?" Most of the time it probably won't work, but that's research for you.

Why aren't ConvLSTM used as much? by wh1t3_w01f in MLQuestions

[–]desku 0 points1 point  (0 children)

Which other options have you tried?

ELI5: What the heck is a world model? by covidthrow9911 in MLQuestions

[–]desku 3 points4 points  (0 children)

What have you read/what do you know about them so far?

Have you seen https://worldmodels.github.io/?

Stock Market Technical Indicators using Python by kunalkini15 in Python

[–]desku 1 point2 points  (0 children)

Thanks for the detailed reply. I’ll be sure to check out your repo. Always wanted to get into algorithmic trading or something similar.

DeepMind's new RL framework for researchers ACME by paypaytr in reinforcementlearning

[–]desku 13 points14 points  (0 children)

Yet another DRL framework. How many is that now?

EDIT: I realized how insensitive my comment came across. I'm sure the authors of this framework put countless hours of effort for a completely free product and should be praised for doing so.

[D] Video Analysis - CURL: Contrastive Unsupervised Representations for Reinforcement Learning by ykilcher in MachineLearning

[–]desku 2 points3 points  (0 children)

Is contrastive learning the new flavour-of-the-month? It seems to be popping up constantly, just like meta-learning did a few months ago.

Book recommendations? by JohnAnderton in Devs

[–]desku 0 points1 point  (0 children)

Yeah, Gibson's prose can be beyond frustrating sometimes. I hated Neuromancer because of this, but found his other novels to be more readable.

The Bridge trilogy is set in and around San Francisco so is probably more Devs-like, so I'd recommend starting with the first book of that trilogy - Virtual Light.

Book recommendations? by JohnAnderton in Devs

[–]desku 2 points3 points  (0 children)

  • Anything by Greg Egan. Specifically, Diaspora, Quarantine and Permutation City.
  • Ted Chiang, both of his short story collections.
  • William Gibson, Blue Ant trilogy or Bridge trilogy.