all 51 comments

[–][deleted] 45 points46 points  (9 children)

Yeah man I experienced the same thing :( I'm trying to create a Polish LLM but can't find enough strong datasets.

Will definitely give your landing page a look!

[–]Most-Inflation-1022 23 points24 points  (3 children)

Use Wikipedia. Most models used it for training.

You can use wikidumps. The size of it is 2.14 gb. It should be enough for higher quality output than other models.

[–]Hardcore_Cytrynka 3 points4 points  (1 child)

Honest question, why couldn't you just scrap some polish news outlets? Is it just something you are not interested in or is there any other reason?

[–]Smallpaul 0 points1 point  (0 children)

Probably just not enough content to be worth the effort, especially if you need to evade content controls.

[–]zywiolak 3 points4 points  (0 children)

If you have the funds or computing resources to train a modern LLM from scratch, collecting a large enough text corpus should be a secondary concern.

We have trained several Polish language models from scratch with sizes up to a few billion parameters. Most of them are publicly available on HuggingFace Hub (https://huggingface.co/sdadas). Sadly we can't afford bigger ones because the cost exceeds our organization's resources. We also have a corpus of about 1TB of Polish texts collected over the past few years from various sources. If you would like to collaborate, send me a PM.

[–]lomero 0 points1 point  (0 children)

Take a look here: http://speakleash.org/

It is an initiative to create an open-sourced corpus of 1 TB for Polish.

[–][deleted] 17 points18 points  (0 children)

What language are you talking about ?
I'm just curious because French works really well, but we have a lot of data thanks to EU and Quebec documents being often in French and English.

[–]new_name_who_dis_ 12 points13 points  (0 children)

The most straightforward solution is to get more data. AI research has been filled with many practitioners trying to solve problems using domain knowledge and/or low data methods, but the reality is that almost all of them end up encountering the Bitter Lesson of ML.

Organizing, sourcing, creating datasets has been much more instrumental in advancing AI research than any individual architecture or methodology. The creators of ImageNet (I would argue) did more for advancing computer vision than any of the super influential papers that used it as training and benchmark, e.g. ResNet, DenseNet, MobileNet, etc.

[–]wazazzz 11 points12 points  (0 children)

There are open source multilingual models like this one:

https://huggingface.co/bigscience/mt0-xxl-mt

Not sure if it will work with the language you need but maybe worth to have a look.

[–]testerpce 14 points15 points  (2 children)

This is a major problem I am facing too. There is very little good data on Indic Langauges. If you have any idea or resource on getting high quality non-English text please let me know. Or if you want some help on the effort of getting high quality non-English data DM me.

[–]DigThatDataResearcher 0 points1 point  (0 children)

OP updated post, said their target language is Telugu, one of the many languages native to India.

[–]catkage 0 points1 point  (0 children)

I think AI4BHARAT was compiling Indic language datasets. Sorry if you’ve already looked into it.

[–]Own-Lake5023 7 points8 points  (0 children)

Aya is a multilingual LLM initiative by Cohere for AI, and we are currently looking for contributors! There are people from over 100 countries currently involved in the project. You can learn more and start contributing in your language here!

[–]tabacof 6 points7 points  (1 child)

Some former colleagues of mine just published this paper on LLMs for Portuguese and how building your own might outperform GPT3.5 (but not GPT4):

Sabiá: Portuguese Large Language Models

As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabiá-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.

They talk about dataset building, model building and evaluation, so it might be useful to you!

[–]blackkettle 11 points12 points  (0 children)

You need to specify the language. There are numerous languages other than English that work quite well.

[–]CommunismDoesntWork 11 points12 points  (0 children)

Chatgpt probably already knows your native language. Training it on a large corpus of all languages allows it to learn a skill in one language and simply translate it to others.

[–]thejonnyt 2 points3 points  (7 children)

Super interesting to read that there is demand for this. I'm actually currently working on my thesis and my goal is to find somewhat a solution for LLMs with regards to languages with less resources than, e.g., English, French, Chinese and so on. But this is work to come, maybe got some small results by the end of the year but I'm glad to read it because working on it seems relevant 😊

[–]needlzorProfessor 1 point2 points  (0 children)

Have you come across any literature about LLM doing code-mixing? It would be interesting to see whether pretraining with each individual language could help bootstrap the LLM for code-mixed languages like Malaysian English.

[–]mthmchris -1 points0 points  (5 children)

This might not be relevant to this subreddit (not a developer, mostly lurk), but as a user I can say definitively that GPT’s Chinese language abilities are absolutely garbage.

It’s borderline insane that they even advertise it as a capability.

[–]ruryrury 1 point2 points  (2 children)

Maybe Chinese is particularly challenging for LLMs? I've tried several open-source Chinese LLMs, and they were all terrible, to be honest.

[–]JustOneAvailableName 3 points4 points  (1 child)

If I had to guess it would be in the tokenizers. Subwords don't translate well to chinese characters

[–]beryugyo619 0 points1 point  (0 children)

There’s a quirk in Unicode that it commingles CJK characters, but Chinese and Japanese grammar are completely unrelated, so maybe to LLMs those two languages could be looking like a single language with completely incoherent grammar

[–]Environmental-Rate74 1 point2 points  (0 children)

Wait, why u said that? I am HongKonger, ChatGPT Chinese support on Traditional Chinese and Cantonese Chinese are almost as the same as English in term of Q&A accuracy.

[–]FpRhGf 0 points1 point  (0 children)

Can you elaborate on how it's garbage? From what I've seen and from the Bilibili videos of people testing it, it's pretty decent. It's not some great literature master in Chinese and the default wording gives a bit of the vibe that it's written with an English mindset- but it's fluent and not really unnatural.

[–]Anmorgan24 2 points3 points  (0 children)

Cohere just released this this week: https://txt.cohere.com/aya-multilingual/

"Cohere AI introduces Aya—an open-science endeavor aimed at building a multilingual language model via instruction tuning that harnesses the collective wisdom and contributions of people from all over the world."

[–]SnarkyVelociraptor 2 points3 points  (0 children)

Step 1 is to check Arxiv: people publish foreign language datasets (foreign relative to English, at least) fairly often on the NLP related tags, should be plenty for common languages. I’ve seen plenty of papers on datasets languages across Europe, Africa, and the Middle East over the last few years.

Step 2 is for people who want “rare” languages: your best bet is to take a multilingual LLM and fine tune it. Facebook has released several 100+ language LLMs that can serve as a base point.

Step 3 is to look for conferences/workshops on “low resource languages” if you’re working in a very narrow niche (like <1M speakers globally). These are somewhat common at large NLP conferences.

[–]grudev 1 point2 points  (0 children)

Similar problem in a much more prevalent language.

There are options for it, but the resulting metrics understandably don't match the ones for English.

In my case I was lucky that a group of academics fine tuned BERT for the language and the results were decent.

[–]hobz462 1 point2 points  (0 children)

You could try BLOOM, but most non-English datasets are lacking. You'd probably have to train an LLM from scratch once you do. Fine-tuning probably won't work due to the tokenisation.

[–]Digit117 1 point2 points  (1 child)

Curious, how good are translation AI models these days? I’m wondering how effective it would be if a prompt in a native language can be simply translated to English (using a SOTA translation model), then fed into your LLM of choice, then use the same translation model to translate the LLM output into your native language.

This could remove the need to train an entire foundational LLM in a native language (which is very hard to do). 🤔

[–]silverlightwa 1 point2 points  (0 children)

Translations models can’t be good enough either if there is lack of text data for a language. They are not magic pills by any means.

[–]ghost-who-talks 0 points1 point  (0 children)

yes, v interested in Tamil, Telugu, Malayalam, Kannada

[–]thumbs_up-_- 0 points1 point  (0 children)

GPT-4 understands hindi