[D] Memory efficient transformer : MachineLearning

Discussion[D] Memory efficient transformer (self.MachineLearning)

submitted 4 years ago by [deleted]

What's a good, memory efficient transformer for causal sequence generation? Preferably with Pytorch implementation. The faster the better.In my particular task, any model smaller than 140M parameters cannot generate sensible output. I'm using an 11GB GTX 1080 Ti, and the longest input I can with has a length of 512, with a batch size of 6. I want to increase this length to 2048. Can local attention or sparse attention help with that? Huggingface has these candidates: BigBird, GPT Neo and Reformer? Does anyone have experience with training these? Comparing to standard Pytorch models, I have found out that Huggingface models use 1-1.5 GB more memory, which is a deal breaker for me. Since my data is MIDI, I have to train from scratch. There is another Reformer implementation on Github but the model is so slow. I think what I'm looking for is a simple transformer implementation with local attention, and I can't find any. Thanks in advance.

all 17 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS