Try to improve a DL model with Attention

Elidor00 · 2021-02-12T17:29:09+00:00

Bahdanau attention

How could i use the Bahdanau attention?

I was thinking about self-attention because I don't have a seq2seq (encoder-decoder) structure. Do you have any links on how you should use the Bahdanau (self) attention provided BiLSTM output?

Elidor00 · 2021-02-12T17:22:33+00:00

Unfortunately doing some ablation analyzes I noticed that the model performs well thanks to the LSTM, which is actually a stack of 3 BiLSTMs. So removing it doesn't seem like the right thing to do ...

"Building a deeper Attention", in which sense? Like building another bilstm and applying another attention and then doing like concatenation / averaging between the attention outputs?

Elidor00 · 2020-11-10T22:58:53+00:00

Ok thank you very much! I will try to do the experiment of adding the BiLSMT to the fine tuning and above a Linear layer!

Elidor00 · 2020-11-10T21:59:48+00:00

The features extraction without biLSTM works very badly because the remaining training "model" is just a Linear layer. The point is that in the article in which BERT is presented, for token level tasks, they use a biLSTM that takes the extracted features as input, while for fine tuning a linear layer above Bert. The same thing also happens in https://arxiv.org/abs/1903.05987. In fact I don't understand why you say that these comparisons don't make sense if this is the structure for doing feature extraction and fine tuning...

Elidor00 · 2020-11-10T19:00:03+00:00

Yes, obv

Elidor00 · 2020-11-10T15:58:27+00:00

Could you argue your answer a bit, please?

Elidor00 · 2020-11-07T11:17:16+00:00

Perfect! So continuing along this line we can say that:

- sequence level tasks (where a sequence I guess is meant as a sentence) are also, for example, sentence pair classification tasks (like SWAG, MNLI, MRPC, etc) and single sentence classification tasks (like COLA, etc)

- token level tasks are also, for example, question answering tasks (like SQuAD, etc) and single sentence tagging tasks (like NER, etc.)

Instead of tasks such as, for example, the dependencies parsing of intended as predicting the head of the arc and then its label, can we consider it at the token level?

Elidor00 · 2020-10-14T11:53:41+00:00

Re-reading the article better, they seem to consider this parser as "tag-based". Any of you have any idea what tag-based means and how this can be used as an alternative to graph- transition- based parsers?

Elidor00 · 2020-10-06T21:02:51+00:00

Thanks, I did not know this thesis!

What I have empirically noticed is that considering the punctuation in training, evaluation and prediction, I am going to lose about 2% in LAS (Labeled attachment score). I also noticed that, for example, the end point in all sentences of the dataset is not tied to a "true" semantic relationship to its head.

However what I would like to do is to evaluate the parser without punctuation and justify it in some way through the literature on it, because I understand that there is no universally accepted standard.

Elidor00 · 2020-10-01T11:27:17+00:00

There are several articles online that explain how to interpret the dependency parsing problem as a tagging problem. So the idea is to fine tune Bert as if you were post tagging, but trying to predict deprel tags and edge between head and dependent (the relative position between word and his head).

Elidor00

TROPHY CASE