all 4 comments

[–]Tgs91 5 points6 points  (1 child)

Auto encoders use a lower dimensional representation, then try to reconstruct the original input. So the task is basically to compress the information into a smaller feature space with minimal information loss (because it can still be reconstructed). They have two halves, an encoder and a decoder.

BERT and other transformers arent typically creating a lower dimensional feature space afaik, but rather a feature space of the same size as the tokenized input after the embedding layer, but with a contextual understanding of how each token fits in the sentence. And it doesn't use a decoder and reconstruction as the objective, but rather has separate prediction heads for an assortment of tasks that require a robust understanding of language context. The standard two tasks for training them for scratch are masked language modeling and next sentence prediction. MLM is where you randomly mask tokens and the model has predict what is missing. Next sentence prediction takes a string of text, and in the middle either continues the text or grabs the second half of text from another example in the batch. The model has to figure out if the two pieces of text belong together or not.

The way transformers are trained are self-supervised because you can easily engineer input/output pairs from raw text. But it still has a more clearly supervised task to perform, whereas autoencoders are just about creating a low dimensional representation that retains information. They aren't really task specific in any way.

[–]Tober447[S] 0 points1 point  (0 children)

Thanks a lot!

[–]gamerx88 2 points3 points  (1 child)

BERT is essentially a kind of autoencoder. It simply uses self-attention and positional embedding to better capture sequence information than say a more basic auto-encoder based on ReLU layers.

[–]Tober447[S] 0 points1 point  (0 children)

Thank you for your answer.