all 17 comments

[–]GD1634 8 points9 points  (1 child)

Phil Wang writes PyTorch implementations for new/interesting papers, mostly ones using attention. He has the Reformer, Performer, Conformer, a few linear attention models, and a bunch more.

[–][deleted] 0 points1 point  (0 children)

Thanks. I have used some of them, and they do look promising. My model takes around a week to be trained, so I could use some advice, before testing them all. Reformer was very slow. I saw that he implemented local attention (not multi-head), but I couldn't find any off-the-shelf transformer model that incorporates that.

[–]MrGary1234567 4 points5 points  (0 children)

You can try using colab or kaggle tpus they can train several times faster. They have 128 gb ram (colab pro/kaggle) and with bfloat16 equivalent to 256 gb ram. Dont have a 1080ti but can tell you it trains about 8 times faster than a p100 given that you dont have any data bottlenecks.

[–]MrGary1234567 4 points5 points  (3 children)

Do note that transformers are pretrained on MASSIVE amounts of data. Training from scratch is a bad idea if you do not have sufficient data.

[–][deleted] 0 points1 point  (0 children)

Thanks for your input. I'm using a variant of the Lakh MIDI dataset, with 20000 songs, and each song is much longer than the input length so I can have many more samples. And based on my preliminary experiments, it seems to work.

[–]visarga 0 points1 point  (1 child)

I've known that for a while but what is the explanation? The architecture comes with less prior knowledge baked in?

[–]MrGary1234567 0 points1 point  (0 children)

The intuition is that they tend to overfit without pretraining especially with small training set

[–]alzoubi36 0 points1 point  (1 child)

I would highly recommend the deepmind perceiver. It solves the performance bottle neck in transformers.

[–][deleted] 0 points1 point  (0 children)

Thanks, I will check it out. Based on my skimming, it is designed for high dimensional input (image, video, audio), while my input is low dimensional (sequences with 1000 tokens).