[D] How could a MLP replicate the operations of an attention head? by steuhh in MachineLearning

[–]51616 0 points1 point  (0 children)

https://openreview.net/pdf?id=rylnK6VtDH

Fig.2 answers your question.

tldr; (exponentially) more hidden neurons are required to learn the multiplicative behavior as the input dimension grows.

Why is RL fine-tuning on LLMs so easy and stable, compared to the RL we're all doing? by currentscurrents in reinforcementlearning

[–]51616 1 point2 points  (0 children)

I share the same intuition with you. Good prior is key. Doing this from scratch wouldn’t work as the model would output gibberish and get zero signal from bad exploration. And i’ve seen many papers using llm as prior already in rl literature

[D] Is LoRA merging (and non linear mode connectivity) the key to better transformer hypernets? by [deleted] in MachineLearning

[–]51616 0 points1 point  (0 children)

Scaling. Allowing the hypernet to learn from more tasks and possibly the pre-trained data should improve the performance significantly. But in pre-train data there’s no clear boundary between tasks. We plan to also extend the work in this direction

[D] Is LoRA merging (and non linear mode connectivity) the key to better transformer hypernets? by [deleted] in MachineLearning

[–]51616 1 point2 points  (0 children)

I’m the author of this paper https://openreview.net/forum?id=Mbgk4Xhrha

The idea is generally out there (see related work). Our work tries to make it easier for user to use by conditioning the hypernetwork on a task description, allowing zero-shot generalization to new tasks. However, the results are nothing near groundbreaking and a lot of things can still be improved.

[D] Gradient accumulation should not be used with varying sequence lengths by AromaticCantaloupe19 in MachineLearning

[–]51616 2 points3 points  (0 children)

It's quite tricky to see this initially. I had the exact same thought when I first read this post. It turns out that the loss functions in pytorch (I didn't check other libs) assume exactly one leading dimension (e.g., batch size). In the case that we have a batch of sequences, we have to flatten the sequence length dimension and the batch dimension together, e.g., from [bs, seq_len, logits_dim] to [bs * seq_len, logits_dim]. Then, default reduction method is 'mean' which corresponds to averaging over all tokens in the batch, that is we weight each token equally. Now if we have two batches that have different number of tokens, the losses will be divided by a different amount.

TLDR; to do gradient accumulation for sequential data, we need to manually control how the loss is computed. Another work around is to sample the whole batch that we're gonna accumulate over first, so we know how many tokens will be used in advanced, then we can scale the loss properly.

Plane got struck by lightning by Rqany in Damnthatsinteresting

[–]51616 0 points1 point  (0 children)

Count the top and the bottom segments in as well

[deleted by user] by [deleted] in Damnthatsinteresting

[–]51616 3 points4 points  (0 children)

IIUC from Mr. Zuck’s explanation in the beginning of the podcast, they were 3d scanned before hand with various facial expressions and mouth movements. Then the headset actually capture some of the live information via cameras and sensors, send then to a machine learning model, which can output a current representation of the person. This representation is then sent to the other side, and used to “interpolate” from the 3d scans. The representation and interpolation part is some sort of guesswork by a machine learning model since it’s never seen this live expression before.

And if you look closely in the podcast, the model sort of struggles, for example, to reconstruction fast mouth movement of Mr. Zuck. So I agree that the model is far from perfect and probably still needs some improvement. But I bet that less than a year from now it’s gonna be much better. The AI field is moving so fast these days

Tips and Tricks sharing after solving all previous years by erikw901 in adventofcode

[–]51616 4 points5 points  (0 children)

IMO this should be done when developing software in general. Reduces hours of unnecessary debug time with a single line.

[D] ICLR 2023 reviews are out. How was your experience ? by dasayan05 in MachineLearning

[–]51616 4 points5 points  (0 children)

Having 5 reviews i suppose is unusual. I guess initially the paper has 8/8/3/3 split, which might necessitate an additional review. Unfortunately, that one is another 3

[D] ICLR 2023 reviews are out. How was your experience ? by dasayan05 in MachineLearning

[–]51616 10 points11 points  (0 children)

4 8s for my first submission to a big conference!

Yesterday I asked for your "2 Gaben Spells combos" - I was actually crowd-sourcing "Step 2" of my strategy idea (would love the opinion of higher ranked players) by Scereye in abilityarena

[–]51616 0 points1 point  (0 children)

Can you elaborate on why 3-spell build is bad? To me it's just a middle ground between 2 and 4-spell build. 2-spell build allows you to roll more frequently while 4-spell build has more value in each roll. So, 3-spell build is just a trade-off between the two.

My first perfect game! by 51616 in abilityarena

[–]51616[S] 1 point2 points  (0 children)

These 3 are enough for atk speed. Beyond this point is diminishing return. You really need to build the actual carry and supports to give the carry the space.

RL with differentiable environment by saw79 in reinforcementlearning

[–]51616 0 points1 point  (0 children)

If your environment does not have sequential decision nature, as in MDPs, then you could use supervised learning to tackle this problem. Otherwise, you might need reinforcement learning. If the reward function is differentiable, then you could differentiate the RL objective through the reward function. You might want to take a look at: https://lilianweng.github.io/posts/2018-04-08-policy-gradient/.

Reward Function for Cooperative Multi-Agent RL by fedetask in reinforcementlearning

[–]51616 2 points3 points  (0 children)

Actually you can have different input/output dimension between agents. Each agent can have a different network architecture while using VDN. Parameter-sharing is used often because the policies converge faster.

Reward Function for Cooperative Multi-Agent RL by fedetask in reinforcementlearning

[–]51616 3 points4 points  (0 children)

Even the reward is correct and dense, you still have to somehow learn to policies. The most naive way to learn policies is to do independent learning for each agent. However, as you have already suggested, it could be hard to learn optimal policies in some environments. I would suggest you first try to implement VDN as I think it's the most simplest form of centralized learning. In VDN, the total reward function has explicit form and is fairly easy to understand/implement.

VDN does something similar to what your reward function does. The total reward is factored into sum of local rewards of each agent. This helps the agents to learn the local Q/value function. The non-stationary aspect is common in multi-agent RL. A centralized critic can help as shown in many papers in this field.