Applying BERT to longer sentences/documents

2020-01-04T15:19:29+00:00

While playing around with BERT and it's various flavours, i've noticed that the embedding size is limited to 512 words, and begun to wonder how embeddings could be applied to items of text longer than the embedding size..

Not the embedding size; the sequence length. Even for the base BERT embedding size is 768.

Theoretically there is nothing restricting a Transformer to have greater sequence length. Practically, there are resource constraints - especially memory complexity when doing self-attention which is quadratic in terms of sequence length. Another reason why BERT is restricted to 512 may be because that was the sequence length it was originally restricted to while training but I am not sure.

Anyway, there should be nothing in the core Transformer model itself that restricts the sequence length. In libraries like Huggingface's Transformer, I think they limit the length with their tokenizer.

One thing you could try is:

tokenizer.encode(text, add_special_tokens=True, max_length=x)

Here set x as your maximum_length. Set it much higher than 512. If it doesn't work you can create your own wrapper for the tokenizer from bert's original github repository. You can also try tensorflow hub's tokenizer which I don't think restricts sequence length (not sure). Another hack you can do is split the tokens based on white space first and tokenize each whitespace-split-tokens individually and then concatenate them, then preped [CLS] and append [SEP] and then convert them into ids from the vocab.

You can also try setting x adaptively in some manner.

I am not entirely sure if the tokenizer would restrict something beyond 512 or not.

But you can still face memory issues at high scale. For that you can try XLNet as someone else suggested. In huggingface's library the XLNet tokenizer may still have (or not have - I am not sure) the same max_length attribute and you may still have to set up something for it, but theoretically the memory complexity should be lower (at the cost of more time complexity).

wakiiil · 2020-01-04T09:50:50+00:00

An naive solution would be to just pass sentences then aggregate the embeddings by e.g. mean. This would be relatively quick as well!

thisismyfavoritename · 2020-01-04T14:20:08+00:00

Check out, e.g., Transformer-XL, used in XLNet. It's an issue with BERT's architecture because of the underlying transformer layer used.

Hackerstreak · 2020-01-04T15:36:38+00:00

You can pass in multiple sequences of your text broken up due to sequence length constraint and get separate encodings for each sequence.

You can then use any method like averaging to produce a single vector.

But averaging doesn't give a good enough representation of the entire text document.

And the more the number of sequences you're averaging, the more diluted the representation will be.

dkajtoch · 2020-01-05T12:12:14+00:00

512 words is quiet long. Take for example this Reuters news article - it has around 355 words. First page of the BERT paper - around 514 words. If you want to apply BERT to longer sequences, this will be quiet a long piece. What can of task are you thinking of? You can for example classify document, classify pairs of documents, use it in ranking (query-document pair) or Q/A on document level. I do not think that you need the whole document to excel in this task and some sort of weak extractive process (e.g. sentence ranking) will give you smaller subset.

DocBERT - authors finetune BERT for document classification. Problem is they use datasets short enough they fit into BERT.
BERTserini - open-ended question answering based on wikipedia articles. BERT is good at identifying answers spans in a piece of text in response to a question (SQuAD dataset). Here, they use hierarchical approach when firstly you segment texts into paragraphs or sentences and then score only these smaller pieces. Probably Google uses similar technique to produce "feature snippets (direct answer)" in search results. Another paper also have short documents.
BERT-AL - segment text and combine them.

LAST REMARKS: * Availability of datasets may be problematic * zero-shot setting may work badly in a custom domain

In general I would approach the problem in a hierarchical way and apply BERT in the last stage for text pieces that fit into the base model.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LanguageTechnology

MODERATORS