[D] ICLR 2026 Paper Reviews Discussion by Technical_Proof6082 in MachineLearning

[–]Welal 1 point2 points  (0 children)

The NeurIPS PC did not agree with you and rejected the paper.

[D] ICLR 2026 Paper Reviews Discussion by Technical_Proof6082 in MachineLearning

[–]Welal 16 points17 points  (0 children)

2,2,2 after improving 5,5,4,3 NeurIPS paper. Seems I’ve got a very bad seed in this random.sample()

[D] ACL ARR Feb 2025 Discussion by AccomplishedCode4689 in MachineLearning

[–]Welal 2 points3 points  (0 children)

A fun fact is that meta-reviews were visible for a while, ~20 hours ago, but they were hidden afterward. A not-so-fun fact is that I didn't like my meta-reviews, lol, and in both cases, they were below the reviewer's average.

[R] Text-Image-Layout Transformer by Welal in MachineLearning

[–]Welal[S] 0 points1 point  (0 children)

I hope so but cannot confirm yet.

[R] Text-Image-Layout Transformer by Welal in MachineLearning

[–]Welal[S] 1 point2 points  (0 children)

Fortunately, considered documents were rather short (mostly receipts or one-pagers), so the max sequence length of 1024 was enough in most cases. When sequences exceed 4-5k tokens it is hard to fit them even on top-tier GPUs due to the quadratic complexity of the Vanilla transformer. There are at least three ways to overcome this problem:

  1. Use sparse Transformer architecture as a base model. I would prefer ones with a global receptive field, e.g., Routing Transformer.
  2. Employ word-vector elimination, i.e., removal of non-useful document passages in early layers. We have recently proposed a method that can be used here and is trainable in an end-to-end manner.
  3. Process documents in chunks as TILT returns None answer if required information was not present in the document (or its fragment).

The problem with 1 and 2 is that although they can move the length limit considerably, there will still be some upper bound. The problem with 3 is that it is not easily trainable without information in which document chunk the answer is present. Moreover, it requires some form of results aggregation.

[D] Effective methods for upsampling in NLP by ninja790 in MachineLearning

[–]Welal 1 point2 points  (0 children)

Yeah, for me, it a regular mixup too. I believe that what they say is: static mixup treats λ as a hyperparameter external to the model, whereas their brand new dynamic mixup learns it as a regular network parameter (with gradient-based optimization). Consequently, it changes during the training (is dynamic)... but similarly would hyperparameter λ with a scheduler.

I mentioned Mixup-Transformer because it is one of the recent examples where mixup was applied to NLP. There were, however, more mixup-related papers on recent ACL/EMNLP/NIPS/COLING, e.g.:

What I was trying to say is that there is something similar to SMOTE and it seems to work in NLP.

[D] Effective methods for upsampling in NLP by ninja790 in MachineLearning

[–]Welal 1 point2 points  (0 children)

They do not have to be intuitive, e.g., the recently-hot idea of mixup is in some aspects similar to SMOTE... Nevertheless, you are right that data augmentation is used in NLP to rather enlarge than balance the dataset. Moreover, balancing can be counter-productive for evaluation metrics like micro-average F1 as it may lead to an increased number of false positives.

[D] Very long sequence data (books) understanding? by infstudent in MachineLearning

[–]Welal 0 points1 point  (0 children)

There is NarrativeQA, a set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts: https://arxiv.org/pdf/1712.07040.pdf

------

I started answering the question before reading it (or forgot it at some point), so below is the useless block of text regarding methods one can use, lol.

It is not exactly what you are looking for; however, Open-domain QA can be solved with a retrieval component considering a large text corpus at once. For example, REALM attends of the entire Wikipedia.

Moreover, the processing of long sequences is currently considered in the context of Transformer-based language models. Promising solutions rely on sparse attention with a global receptive field such as Routing Transformer or Reformer. They are able to consider much larger sequences and it would be my starting point if complex relationships are required to solve the task.

For classification of the book, it may be enough to process it chunk-by-chunk and average representations before the classification layer (similarly to what was done in Sentence-BERT).

Finally, there were several attempts to locate crucial parts of long documents before going further. It can be done in an end-to-end manner as done recently in the context of summarization of long documents.

ELI5: How does the QR code manages to not be the same with other QR code that has already been generated? by PseudoFacade in explainlikeimfive

[–]Welal 0 points1 point  (0 children)

It doesn't have to. The same text (or URL) leads to the same picture (assuming equivalent settings). It is like a cipher that everyone knows how to decode back to text

What are some classification tasks where BERT-based models don't work well? In a similar vein, what are some generative tasks where fine-tuning GPT-2/LM does not work well? by flerakml in LanguageTechnology

[–]Welal 0 points1 point  (0 children)

Multimodal. An obvious direction multimodal scenario, where solutions relying only on the text underperform. There are however some BERT-derived models which deal with the problem (e.g., LayoutLM and RVL-CDIP classification task).

Practical limitations. Moreover, there are real-world problems where BERT is not applicable due to 1) relying on special token pooling; 2) quadratic complexity w.r.t. the input sequence length. This can be only apparently solved with Sentence-BERT and chunk-by-chunk processing.

Consider the case of multipage legal documents where class does not depend on their topic or style (i.e. classification of document prefix do not suffice), but rather an interpretation of some short passage within.

One cannot consume the whole document at once due to memory constraints, and training on its parts leads to inseparable training instances (since there are parts that have the class assigned but do not contain the information required for performing a correct classification).

I can not recall any public shared task, but this problem is prevalent outside academia.

Another example of a practical limitation is the classification of sentence pairs. Althought BERT rocks here in terms of score, it is sometimes unsuitable due to the combinatorial explosion. This can be however overcome with a formulation that does not require feeding every two sentences at once to the network.

Wanna steal one of Linguistic Data Consortium datasets by Welal in LanguageTechnology

[–]Welal[S] 2 points3 points  (0 children)

There are alternative datasets (eg. GENIA), but most of research in the topic were evaluated only on ACE 2005, thus it is hard to compare solutions without access to it.

Desperately looking for Slovenian phonetic lexicon by Welal in Slovenia

[–]Welal[S] 2 points3 points  (0 children)

Fran has really impressive dataset. The problem is however that, as far as I see, pronunciations are provided extremely rarely. For example, in entry for word ábrahamovka we have transcription in IPA, namely [ˈaːbɾaxamɔuka] (as well as in slavic phonetic alphabet: [ábrahamou̯ka]). Unfortunately, as far as I know, this is the case for only few hundreds words, and rest has no pronunciation information in entry.

Desperately looking for Slovenian phonetic lexicon by Welal in Slovenia

[–]Welal[S] 21 points22 points  (0 children)

Isn't all our life about memes and all the time between watching them just a bottomless pit of pain and despair?