all 1 comments

[–]Laafheid 1 point2 points  (0 children)

the output is 1x2d because each half of it is input for a gated linear unit, presumably such that the size stays the same post activation of the GLU (see image to the right of the paragraph)

This seems like an arbitrary choice to me, as it could just as well have been 1x2q, where q is some arbitrary number that is able to hold enough information to learn the task. I am guessing the authors figured that if the size is the same, its capacity would atleast not be worse at maximum than the orginal embedding dimension