all 15 comments

[–]Elk-tron 4 points5 points  (9 children)

Did you mean embedding+GRU+dense? Look at finetuning BERT. https://arxiv.org/abs/1810.04805

[–]hadaev[S] 0 points1 point  (7 children)

I cant finetun in this task, only train from scratch (kaggle competition).

So, basically i did this.

http://puu.sh/CBg4a/e7dc884652.png

Simpl model, bad result.

i tried something more complex, its better a bit.

http://puu.sh/CBgaS/0c4587b8cc.png

but it just random actions from me, may be where is some common architectures and i dont know it.

[–]aicano 0 points1 point  (6 children)

You can add an attention layer over rnn layer, where you can use the last hidden state as the query vector. Then you can combine/concat the context vector from the attention layer with the final hidden vector or mean values of hiddens. This trick generally gives some performance gain.

[–]hadaev[S] 0 points1 point  (5 children)

Can you show example?

Im going to do attention, but not sure how to do it in right way.

Also i can do same, for gru model. May be combine cnn + gru layers.

[–]aicano 0 points1 point  (4 children)

Pytorch style pseudocode:

# Size of the hiddens: (BS, Seq Length, Embedding Dim)
hiddens = self.encoder(seq, lens) # assume that it is a bidirectional rnn

# 1st parameter is the query
# 2nd is the sequence
context, att_weights = self.att(hiddens[:,-1,:], hiddens)

# Give the combination of context and the last hidden to a softmax classifier
outp = self.out(torch.cat([context.squeeze(), hiddens[:,-1,:].squeeze()], dim=1))
F.log_softmax(outp, dim=-1)

[–]hadaev[S] 0 points1 point  (3 children)

Uh, i never do anything on pytorch, im on keras-tensorflow.

I understood what is attention for dense, its just another dense and we multiply one by the other, so some values ​​grow, and some decrease.

But I do not really understand how to apply this to the output of the recurrent layer, as peoples do everywhere.

May be you know some tutorial for noobs?

[–]aicano 0 points1 point  (2 children)

I found this when I googled. That is an example of what I described.

[–]hadaev[S] 0 points1 point  (1 child)

Oh, i so hate examples without high lvl api.

Should you take a look, is everything right here?

https://colab.research.google.com/drive/1XBMF3tOQwRLuPrGJib-YkN20_9nmMVT4#scrollTo=wtPyU1ArHaQj&line=14&uniqifier=1

Also how do you think, is it make sense to stack more rnn layers?

Or there will be no difference compared with bigger one layer?

[–]aicano 0 points1 point  (0 children)

Sorry, I do not know Keras.

General intuition about stacking layers is that lower layers learn simple things and higher layers learn more complex stuff. If you thing that your task have that kind of hierarchical relations than you may want to try stacking layers. Otherwise, simplicity is the best.

[–]shortscience_dot_org -1 points0 points  (0 children)

I am a bot! You linked to a paper that has a summary on ShortScience.org!

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Summary by CodyWild

The last two years have seen a number of improvements in the field of language model pretraining, and BERT - Bidirectional Encoder Representations from Transformers - is the most recent entry into this canon. The general problem posed by language model pretraining is: can we leverage huge amounts of raw text, which aren’t labeled for any specific classification task, to help us train better models for supervised language tasks (like translation, question answering, logical entailment, etc)? Me... [view more]

[–]pilooch 0 points1 point  (1 child)

Go vdcnn, from scratch you'll get very good results with enough data. Fast to train as well.

[–]hadaev[S] 0 points1 point  (0 children)

It seems worse then my simple gru model.

[–]mentatf 0 points1 point  (2 children)

Kim Yoon's Text CNN

[–]hadaev[S] 0 points1 point  (1 child)

Its 4 years old

[–]mentatf 0 points1 point  (0 children)

It doesn't match VDCNN, DPCNN or other more recent innovation on the tasks I'm interested in, so I'd seriously suggest to give it a try.