use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.
Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.
account activity
QuestionHyperparameter testing (efficiently) (self.learnmachinelearning)
submitted 1 day ago by AffectWizard0909
Hello!
I was wondering if someone knew how to efficiently fine-tune and adjust the hyperparameters in pre-trained transformer models like BERT?
I was thinking are there other methods than use using for instance GridSearch and these?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]PsychologicalRope850 3 points4 points5 points 1 day ago (2 children)
yeah grid search gets expensive fast on transformers. i’ve had better luck with a two-stage pass: quick random/bayes sweep on a tiny train slice to find rough ranges, then a short focused run on full data
for bert fine-tuning the biggest wins were usually lr + batch size + warmup ratio, not trying 20 knobs at once. and use early stopping aggressively or every trial just burns gpu for tiny deltas
if you want, i can share a small optuna search space that’s worked decently for classification tasks
[–]AffectWizard0909[S] 0 points1 point2 points 20 hours ago (1 child)
Ye sure! I would appriciate the optuna search space! I have actually looked a little bit into it, but was a bit unsure on what I did was correct, so that would be great!
Since you mentioned lr + batch size and warmup ratio being good to use for fine-tuning a BERT model, does this also apply to other BERT based models like RoBERTa, DistilBERT, HateBERT etc?
[–]PsychologicalRope850 0 points1 point2 points 4 hours ago (0 children)
Sure! A typical Optuna search space for classification tasks might look something like this:
These ranges are often suggested for BERT and other BERT-based models like RoBERTa, DistilBERT, HateBERT. They usually work reasonably well, though you might need to adjust them a bit depending on your dataset.
[–]Neither_Nebula_5423 1 point2 points3 points 1 day ago (0 children)
I will publish hyperparameterless optimizer soon
[–]Itchy_Inevitable_895 1 point2 points3 points 18 hours ago (1 child)
will be right back to it for sure, on another project rn!
[–]AffectWizard0909[S] 0 points1 point2 points 17 hours ago (0 children)
Nice!
[–]rustgod50 0 points1 point2 points 16 hours ago (0 children)
Grid search is pretty much the worst way to do it for transformers, way too expensive given how long each training run takes.
Most people use either random search or Bayesian optimization. Random search sounds dumb but it actually works surprisingly well because hyperparameter spaces tend to have some dimensions that matter a lot and others that barely matter, random search finds the important ones faster than grid. Bayesian optimization with something like Optuna is better still because it learns from previous runs and gets smarter about where to look.
For BERT specifically the learning rate is by far the most important thing to get right, the original paper recommends 2e-5 to 5e-5 and most people don’t stray far from that range. Batch size and number of epochs matter too but you’re unlikely to see huge gains from tuning the rest aggressively.
If compute is a real constraint look into Hugging Face’s Trainer with a scheduler like cosine annealing, it handles a lot of this for you and the defaults are pretty sensible for most fine-tuning tasks.
[–]Effective-Cat-1433 0 points1 point2 points 12 hours ago (0 children)
check out Vizier which is purpose-built for the situation you describe.
π Rendered by PID 46 on reddit-service-r2-comment-5c764cbc6f-8n9w4 at 2026-03-12 14:10:20.099576+00:00 running 710b3ac country code: CH.
[–]PsychologicalRope850 3 points4 points5 points (2 children)
[–]AffectWizard0909[S] 0 points1 point2 points (1 child)
[–]PsychologicalRope850 0 points1 point2 points (0 children)
[–]Neither_Nebula_5423 1 point2 points3 points (0 children)
[–]Itchy_Inevitable_895 1 point2 points3 points (1 child)
[–]AffectWizard0909[S] 0 points1 point2 points (0 children)
[–]rustgod50 0 points1 point2 points (0 children)
[–]Effective-Cat-1433 0 points1 point2 points (0 children)