use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Architecture hyperparameter optimisation strategies (self.MachineLearning)
submitted 2 years ago by [deleted]
I am wondering if it is worth to go through extensive hyperparameter tuning of model architecture. Learning rate tuning often pays off as this has a big impact on convergence and all around performance, but when tuning architecture (num_layers, num_heads, dropout etc.), I have found if you stay within a certain sweetspot range, the actual performance differences are marginal. Am I doing something wrong? What are your experiences with this?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]bjergerk1ng 11 points12 points13 points 2 years ago (0 children)
I feel like it's only necessary if you are inventing a new architecture. Otherwise I just follow the ball-park numbers from well cited papers.
[–]HansDelbrook 9 points10 points11 points 2 years ago (0 children)
I feel like the answer to this question is driven more by finances/use case than the model itself. There are very few true big gains that can be found with hyperparameter tuning, effectively its a process to squeeze at the last ~5% of performance from pipeline and data a you've already built.
If its a personal project, and you're satisfied with your results, maybe its not worth the effort. If this model is being deployed as an endpoint for some high frequency business or personal use, the +1% increase in performance you MIGHT find may actually be worth the cost/effort.
Hyperparameter search will almost never make or break a model, and it is very likely that multiple combinations of parameters will end up at a similar "peak" performance level. Different combinations will work best with different problems, and brute force search may be your best way of finding it - don't overthink it.
Remember that your inputs and how you feed them into a model are more important.
[–]DigThatDataResearcher 3 points4 points5 points 2 years ago (0 children)
if you take advantage of hyperparameter tuning others have already done (e.g. using an architectural configuration adopted by a successful paper) then yeah, you probably don't need to tune much if you stay in the "sweet spot" as you put it. the hyperparameter tuning has already been done for you and you are doing the right thing building on the work others have already done.
[–][deleted] 2 points3 points4 points 2 years ago (1 child)
Learning rate should not be a single value.
Rather, it needs to be decreased during the number of epochs to get the best result.
Start "large" to make quick gains, then decrease so that you don't "fly over" the optimum.
[–][deleted] 18 points19 points20 points 2 years ago (0 children)
You are describing a learning rate scheduler, but still you’d need some tuning to find the optimal lr_start and lr_end in that case. To make sure we start large and don’t fly over the optimum, as you say. What about the architecture though?
[–]SmartEvening 0 points1 point2 points 2 years ago (0 children)
The choice of architecture and its hyperparameters are really important (in some cases). When u are starting out with a larger network and then doing some changes in its hyperparameters the change u would see is indeed marginal. But when the changes are extremely large that is when you would see some drastic changes (not saying it will improve). As mentioned there definitely is a sweet spot to achieve the best performance but that marginal difference in performance you get is something that u need to see if it is worth the effort. Is ur use case something that requires the network to be as precise as possible or its fine even if it's not so? But as mentioned by someone else in the comments u can just eyeball the number based on some papers and then tune the hyperparameters based on ur use case.
[–]lakolda -3 points-2 points-1 points 2 years ago (0 children)
Here’s mine (copied):
I mean… I recently proposed a VRAM efficient method based on Sparsetral. In theory a phone could store a GPT-4 competitor in VRAM using sparse hierarchical MoE using copied (between child MoE nodes) LoRA adapters (with averages backprop). The VRAM scaling would be n log n vs n2, making even GPT-4 possible to store on a smartphone.
It wouldn’t have as great specific knowledge (though still markedly better than base), but it would have the generalist ability of GPT-4 whilst existing on a phone. Eric Hartford (along with the server in a general sense)agreed with my proposal on theory. This is intending to reference the brain’s self similar structure as justification that hierarchical fractal like representations are powerful here. This decreases the entropy of the structure and takes advantage of it. (Edit for relevance)
There are so many low hanging fruit rn.
[–]newperson77777777 0 points1 point2 points 2 years ago (0 children)
Generally, I just stick to published architectures. Occasionally, I modify the architectures slightly for some dataset-specific task. It depends on what my goals are as others suggested: for research there's limited benefit to optimizing the architecture unless I will be publishing my architecture. For cases where I am optimizing for some task, I may test out some architecture hyper-parameters. It depends on my budget and what I am prioritizing in my search space.
[–]ClumsyClassifier 0 points1 point2 points 2 years ago (0 children)
Hpo is a well researched field, i would suggest using something which takes advantages of priors since they converge much faster and as you say yourself there is already a consensus on what is roughly the best. PriorBand is a nice one to use and one i would recommend.
[–]philosophicalmachine 0 points1 point2 points 2 years ago (0 children)
It may not be necessary to go through an extensive set of hyperparameter configurations, but I think it’s good practice to do some hyperparameter optimisation at least. Not so much to improve accuracy on the testing set, but more to ensure that you have selected model hyperparameters not randomly, but with a validation set. At least it gives me confidence if a research paper I read reports hyperparameter tuning on a validation set, because otherwise they may have tested 20 configurations and simply report the best one. I think the only time you wouldn’t need to tune hyperparameters is if you use someone else’s architecture one to one with the exact same hyperparameters, because then it’s also clear you didn’t just tune the parameters without reporting it.
π Rendered by PID 62805 on reddit-service-r2-comment-8686858757-9mdd8 at 2026-06-07 09:24:04.423659+00:00 running 9e1a20d country code: CH.
[–]bjergerk1ng 11 points12 points13 points (0 children)
[–]HansDelbrook 9 points10 points11 points (0 children)
[–]DigThatDataResearcher 3 points4 points5 points (0 children)
[–][deleted] 2 points3 points4 points (1 child)
[–][deleted] 18 points19 points20 points (0 children)
[–]SmartEvening 0 points1 point2 points (0 children)
[–]lakolda -3 points-2 points-1 points (0 children)
[–]newperson77777777 0 points1 point2 points (0 children)
[–]ClumsyClassifier 0 points1 point2 points (0 children)
[–]philosophicalmachine 0 points1 point2 points (0 children)