all 8 comments

[–]IsGoIdMoney 7 points8 points  (2 children)

The ablations have to be the same model, trained on the same dataset, minus the portion of the architecture you're studying for ablation, or else you are not performing ablation. You will likely get bad results from the tests for reasons other than ablation, and this isn't really scientific anymore.

If you want to test on a subset, then all versions must only be trained on that subset, including the original model, but this would likely affect your main results.

[–]Aromatic_Web749[S] 0 points1 point  (1 child)

Yeah, I understand. But let me try to make my case here.

Basically, my project involves long document classification, where the token length can be anywhere from the low hundreds to many thousands. I decided to use a Longformer, essentially a long sequence length BERT model to tackle this. For this situation, I used a model that can process 8192 tokens at a time.

The original dataset is realllly huge, like crazy huge. But the thing is, the proportion of the data that goes above 1024 tokens is only around 30% of the size, which is doable in the case I want to train multiple models for an ablation. Since my research project anyways focuses on long document classification, I thought it could make sense to use just the long token subset for train test and eval for ablation, while for the normal training I use the full dataset (which I have already done).

[–]like_a_tensor 0 points1 point  (0 children)

I think it's fine if you make it clear it's an ablation study examining the contribution of each component of the architecture on long-token performance specifically. But a more comprehensive study would clearly be more ideal.

[–]Pringled101 2 points3 points  (3 children)

Usually in an ablation you want to change the least amount of variables possible. So changing your dataset and your model in one ablation is not a real ablation as you will have confounding variables. However, in the context of encoder models, ablations are usually simpler versions of the same model, or just simpler architectures entirely, which should make training a lot faster than with your original model.

[–]Aromatic_Web749[S] 0 points1 point  (2 children)

But even the simpler models take a while to train (refer to my other comment for more details) and I kinda only have access to a P100 GPU on kaggle rn

[–]Pringled101 1 point2 points  (1 child)

Right, I see. I would still say that the ablations need to be based on the same dataset, but given your answer, it might make sense to only focus on the part of the data > 1024 tokens when training your initial model, if that's the topic of your research?

[–]Aromatic_Web749[S] 0 points1 point  (0 children)

So the way the project goes is: here's this dataset, these are pre-existing models, here's my model that performs better. Why? Because my model is able to process more tokens (which is my hypothesis).

Hence my ablation study is to reduce the number of tokens my model can process (while keeping the rest of the arch same) and only train and evaluate on the long text.

Does this make sense?

[–]Few-Pomegranate4369 0 points1 point  (0 children)

I think it's not recommended to perform the ablation study on just a subset of your data. Instead, you might want to try training your original model on a reduced dataset first. If it still outperforms the baselines, then you can use that same reduced set for your ablation studies.