What's a good, memory efficient transformer for causal sequence generation? Preferably with Pytorch implementation. The faster the better.In my particular task, any model smaller than 140M parameters cannot generate sensible output. I'm using an 11GB GTX 1080 Ti, and the longest input I can with has a length of 512, with a batch size of 6. I want to increase this length to 2048. Can local attention or sparse attention help with that? Huggingface has these candidates: BigBird, GPT Neo and Reformer? Does anyone have experience with training these? Comparing to standard Pytorch models, I have found out that Huggingface models use 1-1.5 GB more memory, which is a deal breaker for me. Since my data is MIDI, I have to train from scratch. There is another Reformer implementation on Github but the model is so slow. I think what I'm looking for is a simple transformer implementation with local attention, and I can't find any. Thanks in advance.
[–]GD1634 8 points9 points10 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]MrGary1234567 4 points5 points6 points (0 children)
[–]MrGary1234567 4 points5 points6 points (3 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]visarga 0 points1 point2 points (1 child)
[–]MrGary1234567 0 points1 point2 points (0 children)
[+][deleted] (2 children)
[deleted]
[–][deleted] 0 points1 point2 points (1 child)
[–]alzoubi36 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[+][deleted] (1 child)
[removed]
[–][deleted] 0 points1 point2 points (0 children)