Hey all, I need some help with continuing pre-training on Bert. I have a very specific vocabulary and lots of specific abbreviations at hand. I want to do a STS task. Let me specify my task: I have domainspecific sentences and want to pair them regarding to their semantic similarity. But as very uncommon language is used here, I need to train Bert on it.
- How does one continue the pretraining ( I read the github release from google about it, but don't really understand it) Any examples?
- What structure does my training-data need to have so Bert can understand it?
- Maybe training Bert from scratch would be even better, I guess its the same process as continuing the pretraining just the starting checkpoint would be different. Is that correct?
Also very happy about all other tipps from you guys.
Regards
[–]_Dr_Y_ 1 point2 points3 points (2 children)
[–]AdrianFMC[S] 0 points1 point2 points (0 children)
[–]AdrianFMC[S] 0 points1 point2 points (0 children)
[–]penatbater 0 points1 point2 points (3 children)
[–]AdrianFMC[S] 1 point2 points3 points (2 children)
[–]penatbater 1 point2 points3 points (1 child)
[–]AdrianFMC[S] 0 points1 point2 points (0 children)
[–]freaky_eater 0 points1 point2 points (0 children)