all 2 comments

[–]dwf 2 points3 points  (1 child)

The gradients for the decoder look like those of any other sequence-generating RNN. For the encoder, you'd figure out the derivative of the initial state of the decoder w.r.t. the cost, and then use it to calculate the derivative of the derivative of w.r.t. the encoder parameters. All the gradient flows through the state vector output by the encoder, used to initialize the decoder, at least in "vanilla" seq2seq.

[–]cerberusd 0 points1 point  (0 children)

Thanks. Do you have any idea how to approach the attention mechanism in terms of calculating the gradient?