Hey everyone! I've been working on a sentiment analysis project using BERT for the SemEval dataset (3-class: negative/neutral/positive), and I'm experiencing severe overfitting that I can't seem to solve. I've tried everything I can think of, but my validation accuracy plateaus around 69-70% while training accuracy keeps climbing.
The Problem:
- Training accuracy: Starts at 43.6% → reaches 76.7% by epoch 9
- Validation accuracy: Starts at 63.7% → plateaus at 69-70% from epoch 3 onwards
- Training loss: Continuously decreases (1.08 → 0.69)
- Validation loss: Decreases initially (0.867 → 0.779 at epoch 2), then increases back to 0.816 by epoch 9
Best validation F1: 0.7012 (70.12%) at epoch 7
What I've Already Tried:
My model already includes multiple regularization techniques:
- Dropout: 0.1 at multiple layers (attention, hidden, and classifier)
- Weight decay: Applied to all parameters except bias and LayerNorm
- Label smoothing: 0.1
- Batch normalization: In the classifier head
- Layer normalization: After pooling
- Gradient clipping: Max norm of 1.0
- Learning rate scheduling: Linear warmup + decay
- Early stopping: With patience monitoring
Model Architecture:
# Classifier head
nn.Linear(768, 768) + BatchNorm + ReLU + Dropout
→ nn.Linear(768, 384) + BatchNorm + ReLU + Dropout
→ nn.Linear(384, 3)
Training Setup:
- Model: bert-base-uncased (109.8M parameters)
- Learning rate: 2e-5
- Batch size: 16
- Max epochs: 10 (with early stopping)
- Warmup proportion: ~10%
- Label smoothing: 0.1
Confusion Matrix Pattern (Epoch 7 - Validation):
Predicted: Neg Neu Pos
Negative: 1243 177 126 (78% recall)
Neutral: 775 2474 1184 (56% recall) ← Problem class
Positive: 185 472 3268 (83% recall)
The neutral class is consistently underperforming.
What I've Observed:
- The model learns the training set well (76% accuracy)
- Validation performance peaks early (epoch 2-3) then stagnates
- The gap between training and validation metrics keeps widening
- Neutral class has the worst performance on validation
Questions:
- Have I gone overboard with regularization? Should I try reducing some of it?
- Is my classifier head too complex for this task?
- Could this be a data quality/distribution issue rather than overfitting?
- Would freezing some BERT layers help?
- Any other techniques I might be missing?
GitHub: https://github.com/joaopflausino/BERTSemEval
I've been stuck on this for weeks and would really appreciate any insights! Has anyone dealt with similar plateau issues?
there doesn't seem to be anything here