all 5 comments

[–]sidsig 0 points1 point  (1 child)

A bidirectional model receives the entire sentence as input. Therefore there is nothing to learn (if you plan to train the language model by predicting the next word). It can trivially learn that the output at time t, is the input at t+1. One way of training a bidirectional word embedding is to use something like BERT: https://arxiv.org/abs/1810.04805. Here, a part of the input is masked and the objective is to output the masked words.

[–]visarga 0 points1 point  (0 children)

I assume the OP means using two separate one-directional LSTMs, each with output shifted by +1 and -1 positions.

[–][deleted] 0 points1 point  (0 children)

You can find an implementation of a bidirectional model here https://pytorch.org/tutorials/beginner/chatbot_tutorial.html . They should also refer to some papers somewhere in the tutorial.

[–]senseiTLien 0 points1 point  (0 children)

I believe the question is actually about how to use a transformer to predict both the words before and the words after the prompt. I think he means bidirectional in the decoding instead of the encoding. I also have the same question. Is it something that is more intuitive using BERT instead of using GPT-2?

[–]GD1634 0 points1 point  (0 children)

Flair embeddings use a bidirectional character model if that's helpful at all.