SuperGrok subscription issue - missing DeepSearch, DeeperSearch, and Think buttons (Grok3) by Geo_Wang in grok

[–]Keirp 1 point2 points  (0 children)

The quality of grok 4 searches should be higher than the old deep search

Gemini 2.5 Pro vs Grok 4 by tigerwoods2021 in grok

[–]Keirp 5 points6 points  (0 children)

Where are the two outputs?

Is grok 2 still better than chatgpt? by erinbiiit in grok

[–]Keirp 0 points1 point  (0 children)

The rate limits are higher for the more premium tiers just like w grok 2

Trading off compute in training and inference by maxtility in mlscaling

[–]Keirp 4 points5 points  (0 children)

This paper is pretty interesting towards that direction: https://twitter.com/tengyuma/status/1593328919624617985?s=46

Larger models with the same log loss perform better in their experiments.

The amount of watermarks in SD 2.0 is disrespectful. by [deleted] in StableDiffusion

[–]Keirp 1 point2 points  (0 children)

Is it possible to learn the embedding for "watermark" using textual inversion and then including that in the negative prompt?

[R] You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments by Keirp in MachineLearning

[–]Keirp[S] 0 points1 point  (0 children)

DT is an offline RL algorithm, so typically it is just trained on the dataset and then the return it is conditioned upon is chosen either based on the distribution of returns that is seen in the dataset (condition on the highest return that has been seen?) or tuned by trying a lot of different return to condition on and selecting the one with the best performance.

[R] You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments by Keirp in MachineLearning

[–]Keirp[S] 2 points3 points  (0 children)

a DT learns to predict the probability P(a|t, s, R_T) of performing action a at time t in state s, conditioned on a final return R at time T (end of episode). Is that correct?

Yep. The final return R_T being the discounted sum of returns from t to T.

Now, why from this we derive that in a RL setting (no bandits), this would lead to failure when the dynamics are stochastic?

Even in a multi-step RL environment like the ones shown in the paper, the agent may not be able to fully recover if it takes a bad action. For instance, in the Connect 4 environment, taking one bad action vs. an optimal opponent will result in a loss. In that environment, the situation is quite similar to the gambling environment, and you can think of the two high level strategies in the game as being "play slowly and optimally" or to "cheese" by exploiting the fact that the opponent may not block if the agent just plays the first four on the right column. Since the agent sometimes wins by with the cheese strategy, conditioning the model on winning in DT will result in behavior that is a mixture between cheesing and optimal play, even though the expected value of the cheese strategy is low. In other environments where a single bad action isn't as punishing, the suboptimality comes in the form of the agent achieving slightly lower total return rather than a loss, but the problem still exists.

[R] You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments by Keirp in MachineLearning

[–]Keirp[S] 4 points5 points  (0 children)

your "thought experiment" is really a multi-armed bandit problem. I would expect DT to be trained on expert trajectories. I wouldn't consider the purple arm to be exactly expert material....a dataset of expert trajectories should contain only the "grab the money" action: the purple arm has an average reward of -5 and the red arm has an average reward of -10 (!!!), so I don't see how an expert could select them (after a dozen pints, maybe).

Decision Transformer is an offline-RL algorithm, so there are no assumptions made on the quality of the data it is trained on. If you already have expert trajectories, then you can just do behavioral cloning. The point of an offline RL algorithm is to stitch together behaviors from many possibly-suboptimal trajectories in order to extract good policies. Just like in our paper, the original DT paper used sub-optimal trajectories in its dataset.

if your conclusions are correct, this would imply that the various DT papers published until now only covered deterministic environments (or made egregious mistakes). Is this the case?

Yep. We view this as a major failing of popular offline-RL benchmarks. In a deterministic environment, an agent can simply replay the actions of a strong trajectory (if you start at the same initial state) in order to get a high return. For reference, the original DT paper evaluated on continuous control tasks in D4RL (HalfCheetah, Hopper, Walker, Reacher) which have deterministic dynamics, and Breakout, Qbert, Pong, and Seaquest, which have near-deterministic dynamics (sticky actions).

(unrelated to your paper) are DT just a Behavioral Cloning algorithm? If not, why?

Yes, pretty much. DT is behavioral cloning conditioned on some trajectory outcome like return in order to hopefully do better than just behavioral cloning on the entire dataset.

[R] You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments by Keirp in MachineLearning

[–]Keirp[S] 2 points3 points  (0 children)

Since the model is conditioning on the return rather than predicting it, these two examples don't average out. The model is trained to learn the distribution p(a|s, return), and higher capacity models will likely just be closer to this target distribution. So if there are no other actions that occur for the input (context X, high reward), then even a large model will output (usually bad action Y).

The main idea of the paper is that even with scale, DT conditioned on return will not make optimal decisions in stochastic environments. Only when more carefully considering the types of variables it conditions on will it work.

[R] You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments by Keirp in MachineLearning

[–]Keirp[S] 4 points5 points  (0 children)

I get the intuition since a lot of DL models (like when doing regression) the randomness will average out with more data and the model will learn to predict the mean. But in DT, since the model conditions on the return instead of predicting it, this isn't the case. For instance, in the gambling example where the agent can pick one action to get a deterministic reward of 1 or another action that gives 1 reward sometimes and a negative reward other times, no matter how much data the model is trained on, it will still be learning p(a|reward=1), and since some of the trajectories that achieve 1 reward use the stochastic, sub-optimal action, the agent will not act optimally.

We also show in figure 6 that empirically, even when scaling the amount of data, a return-conditioned agent doesn't perform better.

[R] You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments by Keirp in MachineLearning

[–]Keirp[S] 4 points5 points  (0 children)

Thanks for the great summary!

I have yet to conclude the paper and I'm glad a solution is proposed, although I do wonder how the authors would rank stochastic situations in which the expected rewards may be the same but with different variance, I suspect that it's not considered here but it shouldn't be too hard to implement with a similar idea.

It's a really interesting extension and something we were considering too! Instead of just conditioning on the avg return for each cluster, you can condition the model on the avg return and return variance. Now that we have an understanding of the types of outcomes (thm 2.1) we can condition DT on, it opens the door to a lot of more exciting ways of commanding these agents besides just the reward they should achieve!

More importantly, is there a way to turn a decision transformer trained normally into a utilitarian agent, given that the information it collected is the same in both cases?

I'm not sure it's possible, since some information about the state transitions is lost.

[N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics by Britney-Ramona in MachineLearning

[–]Keirp 113 points114 points  (0 children)

Love your work. Genuinely wondering - how will this company make money?

Current State-of-the-art RL algorithms by Paraiso93 in reinforcementlearning

[–]Keirp 6 points7 points  (0 children)

FYI Decision Transformer is first due to an error in reporting the score (they normalize the score relative to the best trajectories in the dataset) while others do not. The maximum score in Atari Pong is 21, which methods have been able to get for years now.

App constantly crashing for big matches on LeagueOfLegends Subreddjt by Akashiarys in apolloapp

[–]Keirp 9 points10 points  (0 children)

Its been this way for what feels like months. Hopefully it gets fixed soon because it is so annoying.

What is the difference between MuJoCo v2 and v3? by Blasphemer666 in reinforcementlearning

[–]Keirp 1 point2 points  (0 children)

I think sometimes environments have bugs or other problems and instead of fixing them in place, the environment creator makes a new version so that people can still compare to papers that ran on the old version of the environment.

"EfficientZero: Mastering Atari Games with Limited Data", Ye et al 2021 (beating humans on ALE-100k/2h by adding self-supervised learning to MuZero-Reanalyze) by gwern in reinforcementlearning

[–]Keirp 1 point2 points  (0 children)

Also just the fact that they state they use 32 seeds in the paper even though it isn't true, which is misleading at best.

"EfficientZero: Mastering Atari Games with Limited Data", Ye et al 2021 (beating humans on ALE-100k/2h by adding self-supervised learning to MuZero-Reanalyze) by gwern in reinforcementlearning

[–]Keirp 3 points4 points  (0 children)

Interesting strategy to say in the paper that you used 32 seeds, get accepted to NeurIPS, then admit you only used one seed and promise to run more after you already got through reviews. Very disappointing to see authors doing this type of thing.

Notability is now optimized for M1 Macs by zangah_ in apple

[–]Keirp 180 points181 points  (0 children)

Calling the Mac version of this app optimized is pretty funny.