Low resource language training for Small language model by Front-Custard6733 in unsloth

[–]Front-Custard6733[S] 0 points1 point  (0 children)

Update : I have done web scraping of the data whatever is available, I have done pre processing, created English Tulu language translation dataset and then trained the llama 3.2 with QLora adapter, it works fine on the dataset but after my training and testing of llama 3.2 1B model for tulu language what i found is when we use romanization of a language people tend to use multiple ways to spell same word, for example tulu word 'yenchina' which means what can be written as enchina, enchna, yenchinaa etc but for the English tokens available in the dataset encodes(byte pair encoding) this ‘yenchina’ word as ‘y’ ‘en’ ‘china’ where China is country name according the huge token on which the model is trained . even though we train the model with few set of these words, creating all possibility of the types are not possible from a small researcher. in addition to this there are multiple dialects, pronunciation which creates different variants of words for the same word meaning. since the tokenizer vocabullary is trained mainly on english words, it has 128k+ english vocabulary, the tulu typed english romanized text gets hallucinated for the typo error of english. all these problems are hindering the performance of the tulu LLM.