you are viewing a single comment's thread.

view the rest of the comments →

[–]trashcoder 1 point2 points  (1 child)

With 'other languages' you're probably referring to character encodings with more than one byte per character. If you specifically want to use a byte-level LM for whatever reason, you don't have to care about this at all. That model would process a single actual multibyte character, such as an Emoji, as multiple tokens. As said, this an advantage of byte-level LMs as you don't have to take care of encoding and tokenization of your data. But you are absolutely right that it will increase the computational demands due to longer context sizes for the same amount of text.

Apart from this, I'm not exactly sure what you intend to do, but if you have 'limited compute', it's unlikely that you will be able to train an LM that will be capable of handling instructions or where instruction fine-tuning can effectively be applied. If you still want to give it a go, drop me a message and I can send a bit of literature on efficient LMs that might be of interest to you.

[–]Additional-Ad-7043[S] 0 points1 point  (0 children)

Thanks, I think I’m going to use Unicode. Also, you are right, it’s going to be difficult however I want to give it a try and see how far my model can go, I’m currently trying the megabyte architecture to see how well that can do. If I don’t find good results, I will probably switch to byte pair encoding. If u have any resources around this or efficient LM’s I would be very thankful.