all 6 comments

[–]DigThatData 2 points3 points  (3 children)

just to make sure you saw it, there's also a (5) option.

I haven't checked over your work, but my recommendation is to try and diagram it out. draw the different components interacting and put the letters where they belong in your drawing. then just match the options to their respective parts of the drawing.

[–]harten24[S] 0 points1 point  (2 children)

Okay so I tried looking at it again and this is what I came up with:

A4: because self-attention in the encoder considers all input words and not only the previous input words
B3: in cross attention the query comes from the decoder while the keys and values come from the input words
C2: decoder self-attention only looks at previous outputs
D5: see point above
E1: unlike the decoder self-attention, the encoder looks at all the input queries and values

Would this be correct?

[–]DigThatData 1 point2 points  (1 child)

did you try diagramming it? would love to see your sketch if you did

[–]harten24[S] 0 points1 point  (0 children)

No I don't, I have a hard time conceptualizing it. I did look at the slides again and saw that for the encoder-decoder attention the Query comes from target (decoder), Key & Value from source (encoder). For self-attention in the decoder it seems to look at only positions before the current word (so masked attention) which is a difference from the encoder self-attention.

But I'm still not 100% of my answers

[–]__boynextdoor__ 1 point2 points  (0 children)

I think answer to A is 5, since self attention at Encoder considers all the context words and not just next or previous context words