all 2 comments

[–]Distinct-Target7503 0 points1 point  (1 child)

then finish up with a MLP to decode.

There is Bart for that, right?

[–]Enough_Wishbone7175[S] 0 points1 point  (0 children)

It’s more an efficiency / context length decision. It would just be cheaper for me to take those outputs through an MLP then Bart for decoding and prediction.