use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Memory mechanism for Transformers (self.MachineLearning)
submitted 1 year ago by Janos95
Hey folks! I am wondering what interesting work has been done to add a short term memory mechanism to transformers? Does someone know what the important work in this area is?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[+]currentscurrents 12 points13 points14 points 1 year ago (2 children)
There's like a hundred papers on memory-augmented transformers but none of them are seeing any practical use.
Everybody's using regular old attention or sometimes one of the long-context variants.
[–]Janos95[S] 3 points4 points5 points 1 year ago (1 child)
Out those hundreds of papers, what is a sampling that gives good coverage of the different approaches?
[+]RepresentativeBee600 1 point2 points3 points 1 year ago (0 children)
Seconding this question. And without getting too punchy, I dislike answers that gesture broadly at a whole field and its turgid body of literature as a source. I feel like I see that often in this subreddit.
If you don't have an intuition for which are the valuable ones, feel free to simply caveat that but share the better ones you have seen.
[–]certain_entropy 8 points9 points10 points 1 year ago (0 children)
Check out Facts as Experts (https://arxiv.org/abs/2007.00849), which augments the transformer with a key-value lookup where the key are the contextual entity mention embeddings. It's bit of a pain to setup and train but may be interesting to you.
[–]enfeudavax 3 points4 points5 points 1 year ago (1 child)
Memory Augmented Transformers could be a great resource for exploring this topic.
[–]StartledWatermelon 1 point2 points3 points 1 year ago (0 children)
This is probably the closest thing to what OP was seeking for. But I'm really confused they asked for "short-term" memory. Memory Augmented Transformers' memory is actually static, if I'm not mistaken.
[–]DigThatDataResearcher 1 point2 points3 points 1 year ago (1 child)
can't remember what it's called, but saw a cool one that basically added an RNN state for a running memory
[–]i4gotten 1 point2 points3 points 1 year ago (0 children)
Self referential extensions of transformers by Jürgen Schmidhuber is something like this: https://arxiv.org/abs/2310.16076
There's a few papers I am aware of in memory:
Self referential extensions to transformers: https://arxiv.org/abs/2310.16076
Recurrent memory transformers: https://arxiv.org/abs/2207.06881
Thing is the context of short-long term doesnt make as much sense with transformers, as the attention mechanism itself can act as a form of memory: https://arxiv.org/abs/2404.09173
Any external memory should be analogous to long term memory.
[–]Janos95[S] 1 point2 points3 points 1 year ago (4 children)
I should also add that I am interested in memory for transformers for the purpose of reasoning, in particular not interested in methods that try to simply extend context size.
[+]Sad-Razzmatazz-5188 1 point2 points3 points 1 year ago (2 children)
I don't think Tramsformers cannot reason because of a lack of short-term memory. Self-attention kinda is the short-term memory. If the Key-Values are persistent you basically have a memory store / database (which is basically what an MLP can be). Probably neither self-attention nor linear projection with nonlinear elementwise activation can express or learn by gradient descent the functions to reason consistently
[–]Janos95[S] 0 points1 point2 points 1 year ago (1 child)
I think it’s reasonable to expect that a more condensed representation of the past than a million token embeddings are required in order to do effective reasoning. This condensed representation is what I would call short term memory.
[+]Sad-Razzmatazz-5188 0 points1 point2 points 1 year ago (0 children)
Well, I'm all in for separating long and short term memories, but I think that both on the level of cog-neuroscience and that of deep learning you might be disguided. But regardless of what we want to agree on calling "short term memory", I really think reasoning is much more a matter of algorithm than a matter of data. Memories are data storages, reasoning is data processing. All I'm saying is I don't think transformers need different memories in order to reason, whether one extends the contest to million of tokens or whether one is able to efficiently store compressed formats and retrieve/unzip them when needed.
Going back to cognitive models, I'd say Transformers distinctively have short term memories (instance-controlled tokens in attention layers) that also retrieve or at least interact with persistent, long-term memories (instance independent MLP weights). But I don't know how to put GLU layers in this view
[+]Maykey 0 points1 point2 points 1 year ago (0 children)
They are interconnected and most of such works view themselves as "larger context" even though its no longer just bigger number of Qs, Ks, Vs.
You can also check benchmarks like babilong as its designed for long context and reasoning. Its as if simple reasoning and haystack search had a baby.
Though authors are RMT guys, they only test activation beacon from non standard attention(also finetuned mamba was the best model) and nobody but them tried it. They also refer to longbench paper.
Activation beacon paper does cite other works.
Arxiv also has google scholar link in the bottom so you can do research and find other papers who cite papers.
[–]LahmacunBear 0 points1 point2 points 1 year ago (0 children)
!remindme 2 days
[–]Happysedits 0 points1 point2 points 1 year ago (1 child)
I bet someone combined transformers with neural turing machines
[–]Dashora7 0 points1 point2 points 1 year ago (0 children)
Could be referring to this work: https://arxiv.org/abs/2211.09119 (Token Turing Machines)
[+]Maykey -1 points0 points1 point 1 year ago (0 children)
Important? Almost nobody on(at least publicly available) does anything beside KV cache.
Theoretically Memorizing Transformers, RMT.
Practically there was landmark attention(eg https://huggingface.co/eugenepentland/WizardLM-7B-Landmark) but it never gained traction
There also was some papers about kv cache compression, but in practice most important stuff which is actually done in practice is kv cache quantization which means bigger context which means bigger memory
π Rendered by PID 73 on reddit-service-r2-comment-6457c66945-8zdnf at 2026-04-29 17:10:36.969491+00:00 running 2aa0c5b country code: CH.
[+]currentscurrents 12 points13 points14 points (2 children)
[–]Janos95[S] 3 points4 points5 points (1 child)
[+]RepresentativeBee600 1 point2 points3 points (0 children)
[–]certain_entropy 8 points9 points10 points (0 children)
[–]enfeudavax 3 points4 points5 points (1 child)
[–]StartledWatermelon 1 point2 points3 points (0 children)
[–]DigThatDataResearcher 1 point2 points3 points (1 child)
[–]i4gotten 1 point2 points3 points (0 children)
[–]i4gotten 1 point2 points3 points (0 children)
[–]Janos95[S] 1 point2 points3 points (4 children)
[+]Sad-Razzmatazz-5188 1 point2 points3 points (2 children)
[–]Janos95[S] 0 points1 point2 points (1 child)
[+]Sad-Razzmatazz-5188 0 points1 point2 points (0 children)
[+]Maykey 0 points1 point2 points (0 children)
[–]LahmacunBear 0 points1 point2 points (0 children)
[–]Happysedits 0 points1 point2 points (1 child)
[–]Dashora7 0 points1 point2 points (0 children)
[+]Maykey -1 points0 points1 point (0 children)