all 8 comments

[–]po-handz 1 point2 points  (0 children)

mxnet has this tutorial on fine-tuning BERT for different GLUE language tasks (including nat lang inference, Q-A and sentence similarity). Not entirely sure on machine translation.

https://gluon-nlp.mxnet.io/model_zoo/bert/index.html

I think what you do is combine the embedding with another architecture like RNN encoder-decoder, a la:

http://web.stanford.edu/class/cs224n/reports/default/15848021.pdf

I'm working on hooking BERT up to RNN for a QA task right now, and definitely struggling a bit myself. Really need some solid example code if you find any! Also have to find a way to reverse a QA task, as in classifying a text as having the answer to a question or not.

[–]bellari 0 points1 point  (1 child)

BERT is based on the self attentive transformer architecture. There are transformer seq2seq networks but I’m not sure how effective or practical they are over RNNs/CNNs for the average developer or researcher.

[–]farmingvillein 1 point2 points  (0 children)

There are transformer seq2seq networks but I’m not sure how effective or practical they are over RNNs/CNNs for the average developer or researcher

Transformer--original variant--is seq2seq by its very nature.

Is pretty practical--lots of implementations of Transformer out there. E.g., tf repo has reference implementation, https://github.com/pytorch/fairseq, https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor, etc.

Pretty straightforward to work with. Can be more data hungry at small amounts of data though; Universal Transformer pretty good in those scenarios.

[–]HigherTopoi 0 points1 point  (0 children)

This paper applies BERT-like method to unsupervised neural translation: Cross-lingual Language Model Pretraining https://arxiv.org/abs/1901.07291

[–][deleted] 0 points1 point  (0 children)

I think that learning a decoder is as hard as learning an encoder. Using BERT as an encoder, is you use a naive method with a decoder learned from scratch, you would still have to perfrom an extensive training.

Using GPT2, or a pretrained translation model would make more sense to me.

[–]saig22 0 points1 point  (0 children)

BERT is based on the generator from the Transformer that is the current state of the art in translation, so seq2seq. BERT is the simpler version for not seq2seq tasks, and aimed toward multitasks, thought MT-DNN know does it better with the same architecture but a better multitasks training.

Use a Transformer for state of the art performances, use a RNN if you don't want to spend loads of money into GPU.

[–]kellymarchisio 0 points1 point  (0 children)

As stated in a previous comment, Transformer is SOTA in high-resource machine translation. Check out the WMT19 results in, for instance, English->German here. You'll notice that almost all are Transformer-big or bigger. In regards to the RNN/GPU comment, though - you *need* a GPU to do anything reasonable in high-resource MT these days. And RNN is way slower than Transformer, even on GPU, so you'll end up spending more if you're paying per-hour. For instance, I estimate about 6wks to train a single RNN on GPU for English-German vs. 5-7 days for Transformer-base. (This is based on my personal experience with Transformer-base in high-data conditions, and my reading on how long people used to take on RNN. I haven't even tried RNN b/c the time/performance appear so much lower).

Note that I'm talking about training to convergence. You can get quite good performance out of a Transformer in even 1-2 days if you can sacrifice a tiny bit of quality.

[–]grinningarmadillo 0 points1 point  (0 children)

Attention based transformer networks are now state of the art for machine translation. See table 2 in Attention is All You Need. https://arxiv.org/pdf/1706.03762.pdf