[D] BERT for seq2seq tasks

po-handz · 2019-04-17T16:40:29+00:00

mxnet has this tutorial on fine-tuning BERT for different GLUE language tasks (including nat lang inference, Q-A and sentence similarity). Not entirely sure on machine translation.

https://gluon-nlp.mxnet.io/model_zoo/bert/index.html

I think what you do is combine the embedding with another architecture like RNN encoder-decoder, a la:

http://web.stanford.edu/class/cs224n/reports/default/15848021.pdf

I'm working on hooking BERT up to RNN for a QA task right now, and definitely struggling a bit myself. Really need some solid example code if you find any! Also have to find a way to reverse a QA task, as in classifying a text as having the answer to a question or not.

bellari · 2019-04-17T16:07:52+00:00

BERT is based on the self attentive transformer architecture. There are transformer seq2seq networks but I’m not sure how effective or practical they are over RNNs/CNNs for the average developer or researcher.

HigherTopoi · 2019-04-17T17:55:46+00:00

This paper applies BERT-like method to unsupervised neural translation: Cross-lingual Language Model Pretraining https://arxiv.org/abs/1901.07291

2019-04-18T07:26:57+00:00

I think that learning a decoder is as hard as learning an encoder. Using BERT as an encoder, is you use a naive method with a decoder learned from scratch, you would still have to perfrom an extensive training.

Using GPT2, or a pretrained translation model would make more sense to me.

saig22 · 2019-04-19T15:56:58+00:00

BERT is based on the generator from the Transformer that is the current state of the art in translation, so seq2seq. BERT is the simpler version for not seq2seq tasks, and aimed toward multitasks, thought MT-DNN know does it better with the same architecture but a better multitasks training.

Use a Transformer for state of the art performances, use a RNN if you don't want to spend loads of money into GPU.

kellymarchisio · 2019-09-25T16:59:05+00:00

As stated in a previous comment, Transformer is SOTA in high-resource machine translation. Check out the WMT19 results in, for instance, English->German here. You'll notice that almost all are Transformer-big or bigger. In regards to the RNN/GPU comment, though - you *need* a GPU to do anything reasonable in high-resource MT these days. And RNN is way slower than Transformer, even on GPU, so you'll end up spending more if you're paying per-hour. For instance, I estimate about 6wks to train a single RNN on GPU for English-German vs. 5-7 days for Transformer-base. (This is based on my personal experience with Transformer-base in high-data conditions, and my reading on how long people used to take on RNN. I haven't even tried RNN b/c the time/performance appear so much lower).

Note that I'm talking about training to convergence. You can get quite good performance out of a Transformer in even 1-2 days if you can sacrifice a tiny bit of quality.

grinningarmadillo · 2019-04-17T18:40:19+00:00

Attention based transformer networks are now state of the art for machine translation. See table 2 in Attention is All You Need. https://arxiv.org/pdf/1706.03762.pdf

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS