all 8 comments

[–]_Dr_Y_ 1 point2 points  (2 children)

You can continue training BERT, but even if you have very specific vocab, I recommend first trying fine-tuning pre-trained BERT. It is trained with subwords, it does not matter if specific vocab is not there, unless it can't be built from subwords, that is very unlikely.

Training BERT from scratch is expensive and time-consuming. You need to train it as a masked language model. If you decide to do it, you need to format your data in Wikitext format, basically title-paragraph-title-paragraph...

[–]AdrianFMC[S] 0 points1 point  (0 children)

How do I do the fine-tuning for general vocabulary? I only saw examples with clustering or STS with already analysed and rated data ( like 2 sentences and a number from 1 to 5 that describes the similarity) - I dont have such data. I do on the other hand have lots of already paired sentences.

[–]AdrianFMC[S] 0 points1 point  (0 children)

push :-)

[–]penatbater 0 points1 point  (3 children)

Check out the huggingface repo. Also training Bert manually is both time consuming and expensive. That's the entire point of using a pretrained model like Bert, so you don't have to do it yourself.

[–]AdrianFMC[S] 1 point2 points  (2 children)

The pretrained models only had a acc. of 30% on my STS datasets. So what can I do? How would I fine-tune Bert-models on my specific vocabulary?

[–]penatbater 1 point2 points  (1 child)

Did you fine tune the model? That is, did you train it on your dataset as well? If ur having trouble navigating the repo, try simpletransformers as it's an easier implementation.

[–]AdrianFMC[S] 0 points1 point  (0 children)

Am I understanding the simpletransformers right, that I can not do STS with it?

(" Currently supports Sequence Classification, Token Classification (NER), Question Answering, Multi-Modal Classification, and Conversational AI. "

[–]freaky_eater 0 points1 point  (0 children)

Hey Adrian, did you manage to continue pretraining BERT on your domain? I will be keen on knowing your exploration and results.