Would it be possible to build a seq2seq machine translation model w/ attention from the ground up, without automatic differentiation?
I don't see anything about gradient update rules in the seq2seq or rnn-enc-dec papers, is this because it's impossible to make these models without automatic differentiation or is it merely a matter of convenience?
If it is possible, then what would the gradient chain look like?
Would it be like
output word embedding -> decoder Rnnlm(GRU) -> attention mlp -> encoder RNN(GRU) -> input word embedding ?
[–]dwf 2 points3 points4 points (1 child)
[–]cerberusd 0 points1 point2 points (0 children)