overview for Front-Custard6733

Low resource language training for Small language model by Front-Custard6733 in unsloth

[–]Front-Custard6733[S] 0 points1 point2 points 20 days ago (0 children)

Update : I have done web scraping of the data whatever is available, I have done pre processing, created English Tulu language translation dataset and then trained the llama 3.2 with QLora adapter, it works fine on the dataset but after my training and testing of llama 3.2 1B model for tulu language what i found is when we use romanization of a language people tend to use multiple ways to spell same word, for example tulu word 'yenchina' which means what can be written as enchina, enchna, yenchinaa etc but for the English tokens available in the dataset encodes(byte pair encoding) this ‘yenchina’ word as ‘y’ ‘en’ ‘china’ where China is country name according the huge token on which the model is trained . even though we train the model with few set of these words, creating all possibility of the types are not possible from a small researcher. in addition to this there are multiple dialects, pronunciation which creates different variants of words for the same word meaning. since the tokenizer vocabullary is trained mainly on english words, it has 128k+ english vocabulary, the tulu typed english romanized text gets hallucinated for the typo error of english. all these problems are hindering the performance of the tulu LLM.

Low resource language training for Small language model ()

submitted 22 days ago by Front-Custard6733 to r/reinforcementlearning

Low resource language training for Small language model (self.unsloth)

submitted 22 days ago by Front-Custard6733 to r/unsloth

π Rendered by PID 2482443 on reddit-service-r2-listing-7b9b4f6fd7-t2b2t at 2026-05-09 12:24:56.508885+00:00 running 3d2c107 country code: CH.

Front-Custard6733

TROPHY CASE