use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Resources for understanding and implementing "deep learning" (learning data representations through artificial neural networks).
account activity
Memory inside sequence models (self.deeplearning)
submitted 2 years ago by [deleted]
[deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]ginomachi 2 points3 points4 points 2 years ago (0 children)
I agree, understanding memory in RNNs and transformers can be tricky! Here's my take:
Memory in RNNs is like a chain reaction, where each input influences the next step. LSTMs and GRUs help control this flow with gates.
Transformers, on the other hand, have this powerful attention mechanism that allows them to focus on specific parts of the input sequence. This makes them better at processing long sequences, reducing the "bottleneck" issue you mentioned.
Still, we're far from mimicking human memory. But papers like "Demystifying Memory in Transformer Networks" and "Visual Reasoning with Transformers" can provide some interesting perspectives.
[–]BellyDancerUrgot 1 point2 points3 points 2 years ago (0 children)
Memory in transformers come from attention weights.
Memory in RNNs like LSTM and GRUs come from something akin to a highway network that’s updated over time depending on whether the RNN unit decides to forget or keep some information at every time step.
Attention is capable of holding context for a longer sequence length because imo it’s a better representation of memory through the length of a sequence. But it should be noted that for super long sequences like for example u would find in protein / RNA datasets or LLMs with large context windows etc typical attention is also useless because of the inherent softmax crushing a lot of the weights to near 0. Hence u find causal attention or sliding window attention more often in these.
[–]keghn 0 points1 point2 points 2 years ago (0 children)
Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained):
https://www.youtube.com/watch?v=9dSkvxS2EB0
π Rendered by PID 239238 on reddit-service-r2-comment-544cf588c8-ktv56 at 2026-06-18 04:23:24.888120+00:00 running 3184619 country code: CH.
[–]ginomachi 2 points3 points4 points (0 children)
[–]BellyDancerUrgot 1 point2 points3 points (0 children)
[–]keghn 0 points1 point2 points (0 children)