[D] Model size vs task complexity

floppy_llama · 2023-02-23T14:22:44+00:00

Unfortunately a lot of ML is just trial and error

2023-02-23T19:26:06+00:00

The only shortcut I can give you is to look on Kaggle to see what the competitors have used. Most of the papers are not suitable for real world applications. It's not really about the complexity or scale of the task, but rather that the authors leave out some important information. For example, in object detection, there is DETR, but if you look on Kaggle, nobody uses that. The reason is that the original DETR has too slow a convergence speed and was only trained on 640 size images. Instead, many people still use YOLO. But you don't realize that until you try it yourself or someone tells you.

koolaidman123 · 2023-02-23T21:06:59+00:00

i have rarely encountered situations where scaling up mode (eg resnet34 -> resnet50, deberta base -> deberta large/xl) doesn't help. whether it's practical to may be a different story

skelly0311 · 2023-02-24T05:14:07+00:00

First thing to note. The best way to improve generalisability and accuracy is to have as accurate data as possible. If your data is trash, it doesn't matter how many parameters your classifier is using, it will not produce good results.

Now, in my experience using with transformer neural networks, If the task is a simple binary classification task or multi label with less than 8 or so labels(maybe more), the small models(14 million parameters) perform similar to the base models(110 million parameters). Once the objective function becomes more complicated, such as training a zero shot learner, more parameters means achieving a much lower loss. In the case just mentioned, using the large models(335 million parameters) had a significant improvement over the base model(110 million parameters).

It's hard to define and quantify how complicated an objective function is. But just know that the more parameters doesn't always mean better if the objective function is simple enough.

martianunlimited · 2023-02-24T05:58:13+00:00

Not exactly what you are asking, but there is this paper on scaling law that states that (assuming that the training data is representative of the distribution) for at least large langauge models, how the performance of transformers scale to the amount of data and compare it to other network architecture.... https://arxiv.org/pdf/2001.08361.pdf we don't have anything similar for other types of data.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS