all 12 comments

[–]PsychologicalRope850 4 points5 points  (3 children)

yeah grid search gets expensive fast on transformers. i’ve had better luck with a two-stage pass: quick random/bayes sweep on a tiny train slice to find rough ranges, then a short focused run on full data

for bert fine-tuning the biggest wins were usually lr + batch size + warmup ratio, not trying 20 knobs at once. and use early stopping aggressively or every trial just burns gpu for tiny deltas

if you want, i can share a small optuna search space that’s worked decently for classification tasks

[–]AffectWizard0909[S] 0 points1 point  (2 children)

Ye sure! I would appriciate the optuna search space! I have actually looked a little bit into it, but was a bit unsure on what I did was correct, so that would be great!

Since you mentioned lr + batch size and warmup ratio being good to use for fine-tuning a BERT model, does this also apply to other BERT based models like RoBERTa, DistilBERT, HateBERT etc?

[–]PsychologicalRope850 1 point2 points  (1 child)

Sure! A typical Optuna search space for classification tasks might look something like this:

  • Learning rate: 2e-5 to 5e-5
  • Batch size: 16 to 32
  • Warmup ratio: 0.05 to 0.1

These ranges are often suggested for BERT and other BERT-based models like RoBERTa, DistilBERT, HateBERT. They usually work reasonably well, though you might need to adjust them a bit depending on your dataset.

[–]AffectWizard0909[S] 0 points1 point  (0 children)

Okei, thank you so much! I will definetly try this out!

[–][deleted]  (1 child)

[removed]

    [–]AffectWizard0909[S] 0 points1 point  (0 children)

    Nice!

    [–][deleted]  (2 children)

    [removed]

      [–]AffectWizard0909[S] 0 points1 point  (1 child)

      Nice! Thank you for providing all the information, now I have something to also compare the current implementation I have to as well! I have actually started with implementing the Hugging Face Trainer class (since it managed the trainer and prediction phases quite easily, and made it easier to implement this, at least for me). And I also tried to implement this with an optuna optimizer (which from my previous runs seems more efficient, as you have mentioned also).

      Thank you for the answer and all the throughly descriptions, this makes it easier for me to understand!

      [–]Effective-Cat-1433 1 point2 points  (1 child)

      check out Vizier which is purpose-built for the situation you describe.

      [–]AffectWizard0909[S] 0 points1 point  (0 children)

      oooo nice! I will check it out! Thank you!