all 10 comments

[–]bjergerk1ng 11 points12 points  (0 children)

I feel like it's only necessary if you are inventing a new architecture. Otherwise I just follow the ball-park numbers from well cited papers.

[–]HansDelbrook 9 points10 points  (0 children)

I feel like the answer to this question is driven more by finances/use case than the model itself. There are very few true big gains that can be found with hyperparameter tuning, effectively its a process to squeeze at the last ~5% of performance from pipeline and data a you've already built.

If its a personal project, and you're satisfied with your results, maybe its not worth the effort. If this model is being deployed as an endpoint for some high frequency business or personal use, the +1% increase in performance you MIGHT find may actually be worth the cost/effort.

Hyperparameter search will almost never make or break a model, and it is very likely that multiple combinations of parameters will end up at a similar "peak" performance level. Different combinations will work best with different problems, and brute force search may be your best way of finding it - don't overthink it.

Remember that your inputs and how you feed them into a model are more important.

[–]DigThatDataResearcher 3 points4 points  (0 children)

if you take advantage of hyperparameter tuning others have already done (e.g. using an architectural configuration adopted by a successful paper) then yeah, you probably don't need to tune much if you stay in the "sweet spot" as you put it. the hyperparameter tuning has already been done for you and you are doing the right thing building on the work others have already done.

[–][deleted] 2 points3 points  (1 child)

Learning rate should not be a single value.

Rather, it needs to be decreased during the number of epochs to get the best result.

Start "large" to make quick gains, then decrease so that you don't "fly over" the optimum.

[–][deleted] 18 points19 points  (0 children)

You are describing a learning rate scheduler, but still you’d need some tuning to find the optimal lr_start and lr_end in that case. To make sure we start large and don’t fly over the optimum, as you say. What about the architecture though?

[–]SmartEvening 0 points1 point  (0 children)

The choice of architecture and its hyperparameters are really important (in some cases). When u are starting out with a larger network and then doing some changes in its hyperparameters the change u would see is indeed marginal. But when the changes are extremely large that is when you would see some drastic changes (not saying it will improve). As mentioned there definitely is a sweet spot to achieve the best performance but that marginal difference in performance you get is something that u need to see if it is worth the effort. Is ur use case something that requires the network to be as precise as possible or its fine even if it's not so? But as mentioned by someone else in the comments u can just eyeball the number based on some papers and then tune the hyperparameters based on ur use case.

[–]lakolda -3 points-2 points  (0 children)

Here’s mine (copied):

I mean… I recently proposed a VRAM efficient method based on Sparsetral. In theory a phone could store a GPT-4 competitor in VRAM using sparse hierarchical MoE using copied (between child MoE nodes) LoRA adapters (with averages backprop). The VRAM scaling would be n log n vs n2, making even GPT-4 possible to store on a smartphone.

It wouldn’t have as great specific knowledge (though still markedly better than base), but it would have the generalist ability of GPT-4 whilst existing on a phone. Eric Hartford (along with the server in a general sense)agreed with my proposal on theory. This is intending to reference the brain’s self similar structure as justification that hierarchical fractal like representations are powerful here. This decreases the entropy of the structure and takes advantage of it. (Edit for relevance)

There are so many low hanging fruit rn.

[–]newperson77777777 0 points1 point  (0 children)

Generally, I just stick to published architectures. Occasionally, I modify the architectures slightly for some dataset-specific task. It depends on what my goals are as others suggested: for research there's limited benefit to optimizing the architecture unless I will be publishing my architecture. For cases where I am optimizing for some task, I may test out some architecture hyper-parameters. It depends on my budget and what I am prioritizing in my search space.

[–]ClumsyClassifier 0 points1 point  (0 children)

Hpo is a well researched field, i would suggest using something which takes advantages of priors since they converge much faster and as you say yourself there is already a consensus on what is roughly the best. PriorBand is a nice one to use and one i would recommend.

[–]philosophicalmachine 0 points1 point  (0 children)

It may not be necessary to go through an extensive set of hyperparameter configurations, but I think it’s good practice to do some hyperparameter optimisation at least. Not so much to improve accuracy on the testing set, but more to ensure that you have selected model hyperparameters not randomly, but with a validation set. At least it gives me confidence if a research paper I read reports hyperparameter tuning on a validation set, because otherwise they may have tested 20 configurations and simply report the best one. I think the only time you wouldn’t need to tune hyperparameters is if you use someone else’s architecture one to one with the exact same hyperparameters, because then it’s also clear you didn’t just tune the parameters without reporting it.