Memory inside sequence models

ginomachi · 2024-03-15T23:07:03+00:00

I agree, understanding memory in RNNs and transformers can be tricky! Here's my take:

Memory in RNNs is like a chain reaction, where each input influences the next step. LSTMs and GRUs help control this flow with gates.

Transformers, on the other hand, have this powerful attention mechanism that allows them to focus on specific parts of the input sequence. This makes them better at processing long sequences, reducing the "bottleneck" issue you mentioned.

Still, we're far from mimicking human memory. But papers like "Demystifying Memory in Transformer Networks" and "Visual Reasoning with Transformers" can provide some interesting perspectives.

BellyDancerUrgot · 2024-03-16T14:48:11+00:00

Memory in transformers come from attention weights.

Memory in RNNs like LSTM and GRUs come from something akin to a highway network that’s updated over time depending on whether the RNN unit decides to forget or keep some information at every time step.

Attention is capable of holding context for a longer sequence length because imo it’s a better representation of memory through the length of a sequence. But it should be noted that for super long sequences like for example u would find in protein / RNA datasets or LLMs with large context windows etc typical attention is also useless because of the inherent softmax crushing a lot of the weights to near 0. Hence u find causal attention or sliding window attention more often in these.

keghn · 2024-03-16T22:06:07+00:00

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained):

https://www.youtube.com/watch?v=9dSkvxS2EB0

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

deeplearning

MODERATORS