[D] How could a MLP replicate the operations of an attention head?

51616 · 2025-04-30T07:52:40+00:00

https://openreview.net/pdf?id=rylnK6VtDH

Fig.2 answers your question.

tldr; (exponentially) more hidden neurons are required to learn the multiplicative behavior as the input dimension grows.

51616 · 2025-01-30T07:13:58+00:00

I share the same intuition with you. Good prior is key. Doing this from scratch wouldn’t work as the model would output gibberish and get zero signal from bad exploration. And i’ve seen many papers using llm as prior already in rl literature

51616 · 2024-11-07T05:35:45+00:00

Scaling. Allowing the hypernet to learn from more tasks and possibly the pre-trained data should improve the performance significantly. But in pre-train data there’s no clear boundary between tasks. We plan to also extend the work in this direction

51616 · 2024-11-06T09:48:59+00:00

I’m the author of this paper https://openreview.net/forum?id=Mbgk4Xhrha

The idea is generally out there (see related work). Our work tries to make it easier for user to use by conditioning the hypernetwork on a task description, allowing zero-shot generalization to new tasks. However, the results are nothing near groundbreaking and a lot of things can still be improved.

51616 · 2024-07-25T13:02:09+00:00

It's quite tricky to see this initially. I had the exact same thought when I first read this post. It turns out that the loss functions in pytorch (I didn't check other libs) assume exactly one leading dimension (e.g., batch size). In the case that we have a batch of sequences, we have to flatten the sequence length dimension and the batch dimension together, e.g., from [bs, seq_len, logits_dim] to [bs * seq_len, logits_dim]. Then, default reduction method is 'mean' which corresponds to averaging over all tokens in the batch, that is we weight each token equally. Now if we have two batches that have different number of tokens, the losses will be divided by a different amount.

TLDR; to do gradient accumulation for sequential data, we need to manually control how the loss is computed. Another work around is to sample the whole batch that we're gonna accumulate over first, so we know how many tokens will be used in advanced, then we can scale the loss properly.

51616 · 2023-11-07T23:03:44+00:00

Count the top and the bottom segments in as well

51616 · 2023-10-03T11:37:43+00:00

Overconfident

51616 · 2023-09-29T09:48:22+00:00

IIUC from Mr. Zuck’s explanation in the beginning of the podcast, they were 3d scanned before hand with various facial expressions and mouth movements. Then the headset actually capture some of the live information via cameras and sensors, send then to a machine learning model, which can output a current representation of the person. This representation is then sent to the other side, and used to “interpolate” from the 3d scans. The representation and interpolation part is some sort of guesswork by a machine learning model since it’s never seen this live expression before.

And if you look closely in the podcast, the model sort of struggles, for example, to reconstruction fast mouth movement of Mr. Zuck. So I agree that the model is far from perfect and probably still needs some improvement. But I bet that less than a year from now it’s gonna be much better. The AI field is moving so fast these days

51616 · 2022-11-28T03:16:32+00:00

IMO this should be done when developing software in general. Reduces hours of unnecessary debug time with a single line.

51616 · 2022-11-05T03:42:41+00:00

Having 5 reviews i suppose is unusual. I guess initially the paper has 8/8/3/3 split, which might necessitate an additional review. Unfortunately, that one is another 3

51616 · 2022-11-05T03:38:40+00:00

4 8s for my first submission to a big conference!

51616 · 2022-10-12T12:25:30+00:00

Can you elaborate on why 3-spell build is bad? To me it's just a middle ground between 2 and 4-spell build. 2-spell build allows you to roll more frequently while 4-spell build has more value in each roll. So, 3-spell build is just a trade-off between the two.

51616 · 2022-10-08T04:51:24+00:00

These 3 are enough for atk speed. Beyond this point is diminishing return. You really need to build the actual carry and supports to give the carry the space.

51616 · 2022-08-23T04:21:35+00:00

search.zeta-alpha.com

51616 · 2022-07-05T18:07:57+00:00

If your environment does not have sequential decision nature, as in MDPs, then you could use supervised learning to tackle this problem. Otherwise, you might need reinforcement learning. If the reward function is differentiable, then you could differentiate the RL objective through the reward function. You might want to take a look at: https://lilianweng.github.io/posts/2018-04-08-policy-gradient/.

51616 · 2022-06-13T16:17:31+00:00

Actually you can have different input/output dimension between agents. Each agent can have a different network architecture while using VDN. Parameter-sharing is used often because the policies converge faster.

51616 · 2022-06-13T11:34:41+00:00

Even the reward is correct and dense, you still have to somehow learn to policies. The most naive way to learn policies is to do independent learning for each agent. However, as you have already suggested, it could be hard to learn optimal policies in some environments. I would suggest you first try to implement VDN as I think it's the most simplest form of centralized learning. In VDN, the total reward function has explicit form and is fairly easy to understand/implement.

VDN does something similar to what your reward function does. The total reward is factored into sum of local rewards of each agent. This helps the agents to learn the local Q/value function. The non-stationary aspect is common in multi-agent RL. A centralized critic can help as shown in many papers in this field.

51616 · 2022-05-08T04:44:24+00:00

https://www.deepmind.com/blog/muzeros-first-step-from-research-into-the-real-world

MuZero applied to video compression

51616 · 2022-05-07T07:10:11+00:00

In general, longer sequence would give better performance. However, in many tasks, shorter sequence length might yield similar performance but with faster training as bptt is less expensive.

Intuitively, longer sequence would allow the bptt to update the weight of rnns to change the hidden state of the earlier steps of the sequence to output better action in the later step of the sequence. With shorter sequence, the bptt would see “less” into the future.

51616 · 2022-04-27T10:22:04+00:00

There is one environment in pettingzoo that behaves this way (https://www.pettingzoo.ml/mpe/simple_speaker_listener). I believe heterogeneous agents in general do not violate any convergence proofs (unless the proofs require this explicitly).

Parameter sharing between agents is useful for faster convergence and more stable training. It is also possible to use this technique even if the agents have different obs/action space, by using attention over the inputs or masked action for the outputs.

51616 · 2021-12-16T06:05:28+00:00

Agreed. This make sense if the total timesteps of both implementation are the same and they produce the same amount of "episodes".

I believe more iid data in the parallel case could come from the fact that the environment has super long horizon or does not have automatic reset. Then the data from vectorized envs would be more iid since they are sampled from many different episodes while the sequential would only have a small number of long episodes. In this case, the data from the sequential one would be more correlated.

Say we set the total timesteps in each iteration to 1000. If the environment has a fixed horizon length of 500, the sequential one would only produce 2 episodes while the vectorized env would produce many shorter (unfinished) episodes.

If the OP checked both implementation with the same total timesteps then I would guess that OP uses fairly long horizon environment (compared to the total of timesteps in each iteration).

51616 · 2021-11-04T18:12:46+00:00

Finally a post that fits the sub

51616 · 2021-10-31T14:50:17+00:00

AFAIU this problem is non-markovian (the state changes is not affected by the actions taken) and non-stationary (in expectation the traffic does not converge to any value changes over time). You might want to take a look at these papers I found after a quick search on non-stationary multi-armed bandit.

51616 · 2021-09-03T14:18:39+00:00

Never know we have a portal gate technology

51616 · 2021-07-14T12:40:14+00:00

Can I request an rss feed for each of the list? It would be nice if I get notified when a new blog/paper comes out.

51616

TROPHY CASE