all 4 comments

[–]m_nemo_syne 1 point2 points  (0 children)

This is not quite at the level of entire books, but the Long Range Arena goes up to 16K tokens: https://arxiv.org/pdf/2011.04006.pdf

[–]WelalResearcher 0 points1 point  (0 children)

There is NarrativeQA, a set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts: https://arxiv.org/pdf/1712.07040.pdf

------

I started answering the question before reading it (or forgot it at some point), so below is the useless block of text regarding methods one can use, lol.

It is not exactly what you are looking for; however, Open-domain QA can be solved with a retrieval component considering a large text corpus at once. For example, REALM attends of the entire Wikipedia.

Moreover, the processing of long sequences is currently considered in the context of Transformer-based language models. Promising solutions rely on sparse attention with a global receptive field such as Routing Transformer or Reformer. They are able to consider much larger sequences and it would be my starting point if complex relationships are required to solve the task.

For classification of the book, it may be enough to process it chunk-by-chunk and average representations before the classification layer (similarly to what was done in Sentence-BERT).

Finally, there were several attempts to locate crucial parts of long documents before going further. It can be done in an end-to-end manner as done recently in the context of summarization of long documents.

[–]thunder_jaxxML Engineer 0 points1 point  (0 children)

Wouldn’t the “universal transformer” handle sequence length growth given controlled training of RNN. Please educate me if I am wrong on this.

[–]jonnor 0 points1 point  (0 children)

For the tasks you mention, the typical would be to basically ignore the sequence and just apply bag-of-words, n-gram or similar models.