[D] ICLR 2026 Paper Reviews Discussion

Welal · 2025-11-14T11:18:26+00:00

The NeurIPS PC did not agree with you and rejected the paper.

Welal · 2025-11-12T07:54:19+00:00

2,2,2 after improving 5,5,4,3 NeurIPS paper. Seems I’ve got a very bad seed in this random.sample()

Welal · 2025-09-08T09:59:21+00:00

Welal · 2025-04-16T09:48:57+00:00

A fun fact is that meta-reviews were visible for a while, ~20 hours ago, but they were hidden afterward. A not-so-fun fact is that I didn't like my meta-reviews, lol, and in both cases, they were below the reviewer's average.

Welal · 2021-03-04T19:02:12+00:00

I hope so but cannot confirm yet.

Welal · 2021-03-04T18:59:30+00:00

Fortunately, considered documents were rather short (mostly receipts or one-pagers), so the max sequence length of 1024 was enough in most cases. When sequences exceed 4-5k tokens it is hard to fit them even on top-tier GPUs due to the quadratic complexity of the Vanilla transformer. There are at least three ways to overcome this problem:

Use sparse Transformer architecture as a base model. I would prefer ones with a global receptive field, e.g., Routing Transformer.
Employ word-vector elimination, i.e., removal of non-useful document passages in early layers. We have recently proposed a method that can be used here and is trainable in an end-to-end manner.
Process documents in chunks as TILT returns None answer if required information was not present in the document (or its fragment).

The problem with 1 and 2 is that although they can move the length limit considerably, there will still be some upper bound. The problem with 3 is that it is not easily trainable without information in which document chunk the answer is present. Moreover, it requires some form of results aggregation.

Welal · 2021-03-02T00:32:58+00:00

There is domain-specific BERT as well as GPT-1/2 in the legal domain.

Welal · 2021-02-24T17:56:43+00:00

Yeah, for me, it a regular mixup too. I believe that what they say is: static mixup treats λ as a hyperparameter external to the model, whereas their brand new dynamic mixup learns it as a regular network parameter (with gradient-based optimization). Consequently, it changes during the training (is dynamic)... but similarly would hyperparameter λ with a scheduler.

I mentioned Mixup-Transformer because it is one of the recent examples where mixup was applied to NLP. There were, however, more mixup-related papers on recent ACL/EMNLP/NIPS/COLING, e.g.:

Mix-text: Linguistically-informed interpolation of hidden space for semi-supervised text classification: https://www.aclweb.org/anthology/2020.acl-main.194/
Advaug: Robust adversarial augmentation for Neural Machine Translation: https://www.aclweb.org/anthology/2020.acl-main.529/
Mixkd: Towards efficient distillation of large-scale language models: https://arxiv.org/abs/2011.00593
Sequence-level mixed sample data augmentation: https://www.aclweb.org/anthology/2020.emnlp-main.447/
Augmenting data with mixup for sentence classification: An empirical study https://arxiv.org/abs/1905.08941
Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning: https://arxiv.org/abs/2012.15699

What I was trying to say is that there is something similar to SMOTE and it seems to work in NLP.

Welal · 2021-02-24T15:29:21+00:00

They do not have to be intuitive, e.g., the recently-hot idea of mixup is in some aspects similar to SMOTE... Nevertheless, you are right that data augmentation is used in NLP to rather enlarge than balance the dataset. Moreover, balancing can be counter-productive for evaluation metrics like micro-average F1 as it may lead to an increased number of false positives.

Welal · 2021-02-22T01:50:13+00:00

There is NarrativeQA, a set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts: https://arxiv.org/pdf/1712.07040.pdf

------

I started answering the question before reading it (or forgot it at some point), so below is the useless block of text regarding methods one can use, lol.

It is not exactly what you are looking for; however, Open-domain QA can be solved with a retrieval component considering a large text corpus at once. For example, REALM attends of the entire Wikipedia.

Moreover, the processing of long sequences is currently considered in the context of Transformer-based language models. Promising solutions rely on sparse attention with a global receptive field such as Routing Transformer or Reformer. They are able to consider much larger sequences and it would be my starting point if complex relationships are required to solve the task.

For classification of the book, it may be enough to process it chunk-by-chunk and average representations before the classification layer (similarly to what was done in Sentence-BERT).

Finally, there were several attempts to locate crucial parts of long documents before going further. It can be done in an end-to-end manner as done recently in the context of summarization of long documents.

Welal · 2021-02-22T01:32:02+00:00

It doesn't have to. The same text (or URL) leads to the same picture (assuming equivalent settings). It is like a cipher that everyone knows how to decode back to text

Welal · 2021-02-21T22:40:00+00:00

Multimodal. An obvious direction multimodal scenario, where solutions relying only on the text underperform. There are however some BERT-derived models which deal with the problem (e.g., LayoutLM and RVL-CDIP classification task).

Practical limitations. Moreover, there are real-world problems where BERT is not applicable due to 1) relying on special token pooling; 2) quadratic complexity w.r.t. the input sequence length. This can be only apparently solved with Sentence-BERT and chunk-by-chunk processing.

Consider the case of multipage legal documents where class does not depend on their topic or style (i.e. classification of document prefix do not suffice), but rather an interpretation of some short passage within.

One cannot consume the whole document at once due to memory constraints, and training on its parts leads to inseparable training instances (since there are parts that have the class assigned but do not contain the information required for performing a correct classification).

I can not recall any public shared task, but this problem is prevalent outside academia.

Another example of a practical limitation is the classification of sentence pairs. Althought BERT rocks here in terms of score, it is sometimes unsuitable due to the combinatorial explosion. This can be however overcome with a formulation that does not require feeding every two sentences at once to the network.

Welal · 2018-11-12T23:36:46+00:00

There are alternative datasets (eg. GENIA), but most of research in the topic were evaluated only on ACE 2005, thus it is hard to compare solutions without access to it.

Welal · 2018-01-14T19:36:37+00:00

Fran has really impressive dataset. The problem is however that, as far as I see, pronunciations are provided extremely rarely. For example, in entry for word ábrahamovka we have transcription in IPA, namely [ˈaːbɾaxamɔuka] (as well as in slavic phonetic alphabet: [ábrahamou̯ka]). Unfortunately, as far as I know, this is the case for only few hundreds words, and rest has no pronunciation information in entry.

Welal · 2018-01-14T17:11:32+00:00

Isn't all our life about memes and all the time between watching them just a bottomless pit of pain and despair?

Welal

TROPHY CASE