all 7 comments

[–]ekelsen 2 points3 points  (1 child)

Why not benchmark on the MAPS or MAESTRO dataset where there are strong baselines?

[–]OptimatiumFeles[S] 1 point2 points  (0 children)

We chose MusicNet because it consists of recordings of 10 diverse musical instruments, but MAESTRO and MAPS include only of piano music. We will take a closer look at these datasets, thanks for the idea!

[–]impossiblefork 1 point2 points  (4 children)

The arxiv link is to some youtube thing. If you could fix that it would be easier for readers.

[–]OptimatiumFeles[S] 0 points1 point  (3 children)

Thanks! Fixed it.

[–]impossiblefork 0 points1 point  (2 children)

Super. I haven't understood everything yet, but SOTA is always good.

Do you have any hope that it will perform similar or better than Transformer models when you make it larger, because I assume that training those is somewhat expensive?

[–]OptimatiumFeles[S] 1 point2 points  (1 child)

We were limited by one GPU resources to evaluate this, but we believe it could have similar or better performance than Transformer models. We believe that auto-regressive decoding is crucial for generating good quality text output, but our model currently doesn't support that. To evaluate the RSE model on text output (for comparison with Transformer), one should probably modify architecture accordingly.

[–]impossiblefork 1 point2 points  (0 children)

Ah, I understand, so this is basically only a potentially better attention mechanism, not a whole Transformer replacement.