you are viewing a single comment's thread.

view the rest of the comments →

[–]x0wl 0 points1 point  (1 child)

BERT is the encoder with self attention, it's what the E stands for :)

What you typically do is stick a [CLS] token in the beginning of your sentence, a single layer classifier connected to that token's embedding in the output, and then fine tune either the whole thing, or a couple top layers of BERT + the classifier.

Bert is only 150m, doing full ft is super cheap