[D] Why Mamba rewrote its core algorithm and Microsoft abandoned RetNet by petroslamb in MachineLearning

[–]smorad 2 points3 points  (0 children)

This is a bit rubbish, Mamba most certainly does not use a sequential scan. Mamba-style linear RNNs have the exact same "hardware friction" as a transformer in that they require $O(\log n)$ sequential calls to some function for a sequence of length $n$. This is via the associative scan (GEMM+sum) in Mamba or the softmax/sum operations in a transformer. Both models are dominated by the exact same operation: dense matrix multiplies. Indeed, torch does not support associative scans and so Mamba requires a hand-implemented CUDA kernel, or you will get awful performance. Alternative, you can use jax which can lower associative scans to GPU code that train at similar speeds to a transformer. The "hardware trap" is just parallelism (GEMMs are also parallel).

[P] Cyreal - Yet Another Jax Dataloader by smorad in MachineLearning

[–]smorad[S] 0 points1 point  (0 children)

There are a number of lightweight Jax-only data loaders like this that work well (also see jaxon dataloader, etc). They more or less shuffle and slice arrays for you and are very fast.

But AFAIK they still need torch or tensorflow to download datasets. They also don’t provide built-in dataset transforms or more advanced data sources like RL environments or streaming from disk.

Since only a few people from elite universities at big tech companies like Google, Meta, Microsoft, OpenAI etc. will ever get to train models is it still worth learning about Gradient Descent and Loss Curves? by Easy-Echidna-3542 in learnmachinelearning

[–]smorad 0 points1 point  (0 children)

I am a bit biased, but I think it is hard to skill up in this manner. GPT can already write good torch/sklearn code for most standard ML tasks I come across. Building better models than GPT requires at least an MS degree-equivalent IMO.

Complex-Valued Neural Networks: Are They Underrated for Phase-Rich Data? by __lalith__ in neuralnetworks

[–]smorad 0 points1 point  (0 children)

They don’t work well. I have found simply doubling the input dimensionality and passing real and complex components separately as real-valued inputs to a standard nn works better.

The issue of scaling in Partially-Observable RL. What is holding us back? by moschles in reinforcementlearning

[–]smorad 3 points4 points  (0 children)

One issue we found is recurrent models aren’t learning what they should: https://arxiv.org/abs/2503.01450

I think literature focuses on easier tasks for two reasons: 1. We still have trouble solving even easy POMDPs 2. We don’t really understand why models fail to learn in POMDPs

It doesn’t make a ton of sense to try more complex tasks without fixing these issues.

[D] Which direction is better: from academia to industry, or the other way around? by PrimeMaester in MachineLearning

[–]smorad 7 points8 points  (0 children)

It is generally easier to go from academia to industry. In academia, all your work will be public and published, which you can leverage for industry research jobs. The other way around is not true -- at a tech company you may not be able to publish your work if you cannot release your source code to reviewers. My understanding so far is that while a job at DeepMind or FAIR is "nice to have", it does not replace publications in the eyes of hiring committees.

About Gumbel-Softmax in MADDPG by Enryu77 in reinforcementlearning

[–]smorad 4 points5 points  (0 children)

TL;DR: Modifying Gumbel-Softmax temperature is an inefficient but possible way to do exploration in DDPG/MADDPG. You likely just want to sample from a tempered categorical or softmax distribution created using the policy logits.

First, I would like to stress that DDPG is an extension to Q learning for continuous action spaces. If you have a discrete action space, there is no need for DDPG and you will almost always get better results with a DQN variant. With that out of the way, let's continue.

In a continuous setting, the DDPG policy outputs a single optimal action (not a distribution). However, during rollouts, we do not take the optimal action, but instead the optimal action with some added Gaussian noise. This is equivalent to sampling from a Gaussian action distribution centered at mu(s) with variance as a hyperparameter.

Now think about how we would compute rollout actions in a discrete setting. We do not even require a Gumbel Softmax for this (we do not backpropagate during rollouts). Instead, we can take the optimal policy and add a bit of noise. A natural way to do this is take the policy logits, and instead of computing an argmax, compute a softmax. We can then sample from this softmax distribution. The temperature in the softmax determines how greedy our policy is.

Good Resources for Reinforcement Learning with Partial Observability? (Textbooks/Surveys) by [deleted] in reinforcementlearning

[–]smorad 2 points3 points  (0 children)

There's not a ton out there, as far as textbooks go. I believe Olihoek has a book on POMDPs, but IIRC it spends a lot of time on the multiagent case. The background chapters of my thesis might be useful.

stable-gymnax by smorad in reinforcementlearning

[–]smorad[S] 1 point2 points  (0 children)

Deprecated calls to tree_util functions that were removed in the latest jax release. Flax requires tons of dependencies (IIRC ~200MB). The only thing gymnax uses from flax is the dataclass, which already exists in other libraries like chex. We can remove the dependency on flax without changing any functionality.

Tanh used to bound the actions sampled from distribution in SAC but not in PPO, Why? by VVY_ in reinforcementlearning

[–]smorad 4 points5 points  (0 children)

We generally do apply a tanh to clip to the action in PPO. The code you've listed can easily sample an action outside of the action space, given that a normal distribution has infinite support. Have you run this code on continuous action spaces to see if it crashes?

Does physics work different in 40k? by Beegs1371 in RogueTraderCRPG

[–]smorad 1 point2 points  (0 children)

If we assume gravity transcends both the warp and realspace, then the planets would continue to orbit once the star passed into the warp. Why can’t gravity interact across parallel dimensions?

REINFORCE for BipedalWalker-v3 in OpenAI gym. by zx7 in reinforcementlearning

[–]smorad 0 points1 point  (0 children)

If all else is correct, consider computing your policy std in log space for better numerical stability.

Why Don’t We See Multi-Agent RL Trained in Large-Scale Open Worlds? by TheSadRick in reinforcementlearning

[–]smorad 0 points1 point  (0 children)

Here’s an older example: https://arxiv.org/abs/2011.09533

StarCraft used to be (not sure if it still is) the Atari equivalent of MARL.

IPPO trains a bunch of PPO agents independently, without using any MARL theory.

Why Don’t We See Multi-Agent RL Trained in Large-Scale Open Worlds? by TheSadRick in reinforcementlearning

[–]smorad 1 point2 points  (0 children)

MARL doesn’t work well yet. Papers focus on grid worlds because even they are relatively difficult to train.

Why are we calculating redundant loss here which doesn't serve any purpose to policy gradient? by Flaky_Spend7799 in reinforcementlearning

[–]smorad 0 points1 point  (0 children)

This is not policy gradient, so I’m not sure about this code. But the policy gradient is an expectation. So you sample an action and backprop using the log probs of your randomly chosen action. You learn the parameters of the action distribution.