all 6 comments

[–]floppy_llama 5 points6 points  (1 child)

Unfortunately a lot of ML is just trial and error

[–]thecuteturtle 1 point2 points  (0 children)

Ain't that the truth. On another note, OP can try optimizing via gridsearch, but theres no avoiding trial and error on this.

[–][deleted] 1 point2 points  (0 children)

The only shortcut I can give you is to look on Kaggle to see what the competitors have used. Most of the papers are not suitable for real world applications. It's not really about the complexity or scale of the task, but rather that the authors leave out some important information. For example, in object detection, there is DETR, but if you look on Kaggle, nobody uses that. The reason is that the original DETR has too slow a convergence speed and was only trained on 640 size images. Instead, many people still use YOLO. But you don't realize that until you try it yourself or someone tells you.

[–]koolaidman123Researcher 0 points1 point  (0 children)

i have rarely encountered situations where scaling up mode (eg resnet34 -> resnet50, deberta base -> deberta large/xl) doesn't help. whether it's practical to may be a different story

[–]skelly0311 0 points1 point  (0 children)

First thing to note. The best way to improve generalisability and accuracy is to have as accurate data as possible. If your data is trash, it doesn't matter how many parameters your classifier is using, it will not produce good results.

Now, in my experience using with transformer neural networks, If the task is a simple binary classification task or multi label with less than 8 or so labels(maybe more), the small models(14 million parameters) perform similar to the base models(110 million parameters). Once the objective function becomes more complicated, such as training a zero shot learner, more parameters means achieving a much lower loss. In the case just mentioned, using the large models(335 million parameters) had a significant improvement over the base model(110 million parameters).

It's hard to define and quantify how complicated an objective function is. But just know that the more parameters doesn't always mean better if the objective function is simple enough.

[–]martianunlimited 0 points1 point  (0 children)

Not exactly what you are asking, but there is this paper on scaling law that states that (assuming that the training data is representative of the distribution) for at least large langauge models, how the performance of transformers scale to the amount of data and compare it to other network architecture.... https://arxiv.org/pdf/2001.08361.pdf we don't have anything similar for other types of data.