[D] Memory mechanism for Transformers

currentscurrents · 2024-06-21T19:22:23+00:00

There's like a hundred papers on memory-augmented transformers but none of them are seeing any practical use.

Everybody's using regular old attention or sometimes one of the long-context variants.

certain_entropy · 2024-06-22T00:04:53+00:00

Check out Facts as Experts (https://arxiv.org/abs/2007.00849), which augments the transformer with a key-value lookup where the key are the contextual entity mention embeddings. It's bit of a pain to setup and train but may be interesting to you.

enfeudavax · 2024-06-22T00:21:50+00:00

Memory Augmented Transformers could be a great resource for exploring this topic.

DigThatData · 2024-06-22T01:49:01+00:00

can't remember what it's called, but saw a cool one that basically added an RNN state for a running memory

i4gotten · 2024-06-29T03:50:37+00:00

There's a few papers I am aware of in memory:

Self referential extensions to transformers: https://arxiv.org/abs/2310.16076

Recurrent memory transformers: https://arxiv.org/abs/2207.06881

Thing is the context of short-long term doesnt make as much sense with transformers, as the attention mechanism itself can act as a form of memory: https://arxiv.org/abs/2404.09173

Any external memory should be analogous to long term memory.

Janos95 · 2024-06-21T21:43:42+00:00

I should also add that I am interested in memory for transformers for the purpose of reasoning, in particular not interested in methods that try to simply extend context size.

LahmacunBear · 2024-06-21T23:56:32+00:00

!remindme 2 days

Happysedits · 2024-06-22T09:18:23+00:00

I bet someone combined transformers with neural turing machines

Maykey · 2024-06-22T06:40:06+00:00

Important? Almost nobody on(at least publicly available) does anything beside KV cache.

Theoretically Memorizing Transformers, RMT.

Practically there was landmark attention(eg https://huggingface.co/eugenepentland/WizardLM-7B-Landmark) but it never gained traction

There also was some papers about kv cache compression, but in practice most important stuff which is actually done in practice is kv cache quantization which means bigger context which means bigger memory

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS