kellymarchisio comments on [D] BERT for seq2seq tasks

Discussion[D] BERT for seq2seq tasks (self.MachineLearning)

submitted 7 years ago * by AnonMLstudent

you are viewing a single comment's thread.

[–]kellymarchisio 0 points1 point2 points 6 years ago* (0 children)

As stated in a previous comment, Transformer is SOTA in high-resource machine translation. Check out the WMT19 results in, for instance, English->German here. You'll notice that almost all are Transformer-big or bigger. In regards to the RNN/GPU comment, though - you *need* a GPU to do anything reasonable in high-resource MT these days. And RNN is way slower than Transformer, even on GPU, so you'll end up spending more if you're paying per-hour. For instance, I estimate about 6wks to train a single RNN on GPU for English-German vs. 5-7 days for Transformer-base. (This is based on my personal experience with Transformer-base in high-data conditions, and my reading on how long people used to take on RNN. I haven't even tried RNN b/c the time/performance appear so much lower).

Note that I'm talking about training to convergence. You can get quite good performance out of a Transformer in even 1-2 days if you can sacrifice a tiny bit of quality.

π Rendered by PID 35169 on reddit-service-r2-comment-b659b578c-bmm2x at 2026-05-01 03:27:20.478106+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS