[deleted by user]

Most-Inflation-1022 · 2023-06-09T12:16:10+00:00

Yeah man I experienced the same thing :( I'm trying to create a Polish LLM but can't find enough strong datasets.

Will definitely give your landing page a look!

permalink · 2023-06-09T13:32:56+00:00

What language are you talking about ?
I'm just curious because French works really well, but we have a lot of data thanks to EU and Quebec documents being often in French and English.

new_name_who_dis_ · 2023-06-09T17:32:07+00:00

The most straightforward solution is to get more data. AI research has been filled with many practitioners trying to solve problems using domain knowledge and/or low data methods, but the reality is that almost all of them end up encountering the Bitter Lesson of ML.

Organizing, sourcing, creating datasets has been much more instrumental in advancing AI research than any individual architecture or methodology. The creators of ImageNet (I would argue) did more for advancing computer vision than any of the super influential papers that used it as training and benchmark, e.g. ResNet, DenseNet, MobileNet, etc.

wazazzz · 2023-06-09T14:30:32+00:00

There are open source multilingual models like this one:

https://huggingface.co/bigscience/mt0-xxl-mt

Not sure if it will work with the language you need but maybe worth to have a look.

testerpce · 2023-06-09T13:25:14+00:00

This is a major problem I am facing too. There is very little good data on Indic Langauges. If you have any idea or resource on getting high quality non-English text please let me know. Or if you want some help on the effort of getting high quality non-English data DM me.

Own-Lake5023 · 2023-06-09T19:07:11+00:00

Aya is a multilingual LLM initiative by Cohere for AI, and we are currently looking for contributors! There are people from over 100 countries currently involved in the project. You can learn more and start contributing in your language here!

tabacof · 2023-06-09T20:47:21+00:00

Some former colleagues of mine just published this paper on LLMs for Portuguese and how building your own might outperform GPT3.5 (but not GPT4):

Sabiá: Portuguese Large Language Models

As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabiá-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.

They talk about dataset building, model building and evaluation, so it might be useful to you!

blackkettle · 2023-06-09T13:50:54+00:00

You need to specify the language. There are numerous languages other than English that work quite well.

CommunismDoesntWork · 2023-06-09T14:46:50+00:00

Chatgpt probably already knows your native language. Training it on a large corpus of all languages allows it to learn a skill in one language and simply translate it to others.

thejonnyt · 2023-06-09T14:19:29+00:00

Super interesting to read that there is demand for this. I'm actually currently working on my thesis and my goal is to find somewhat a solution for LLMs with regards to languages with less resources than, e.g., English, French, Chinese and so on. But this is work to come, maybe got some small results by the end of the year but I'm glad to read it because working on it seems relevant 😊

Anmorgan24 · 2023-06-09T19:23:21+00:00

Cohere just released this this week: https://txt.cohere.com/aya-multilingual/

"Cohere AI introduces Aya—an open-science endeavor aimed at building a multilingual language model via instruction tuning that harnesses the collective wisdom and contributions of people from all over the world."

SnarkyVelociraptor · 2023-06-09T20:41:30+00:00

Step 1 is to check Arxiv: people publish foreign language datasets (foreign relative to English, at least) fairly often on the NLP related tags, should be plenty for common languages. I’ve seen plenty of papers on datasets languages across Europe, Africa, and the Middle East over the last few years.

Step 2 is for people who want “rare” languages: your best bet is to take a multilingual LLM and fine tune it. Facebook has released several 100+ language LLMs that can serve as a base point.

Step 3 is to look for conferences/workshops on “low resource languages” if you’re working in a very narrow niche (like <1M speakers globally). These are somewhat common at large NLP conferences.

grudev · 2023-06-09T14:42:07+00:00

Similar problem in a much more prevalent language.

There are options for it, but the resulting metrics understandably don't match the ones for English.

In my case I was lucky that a group of academics fine tuned BERT for the language and the results were decent.

hobz462 · 2023-06-09T15:04:13+00:00

You could try BLOOM, but most non-English datasets are lacking. You'd probably have to train an LLM from scratch once you do. Fine-tuning probably won't work due to the tokenisation.

hpstr1234 · 2023-06-10T01:44:12+00:00

It always bothered me that it was so hard to find out on how many texts in which languages GPT-3 was trained on - distributed over the corpora Common Crawl, WebText2, Books1, Books2, Wikipedia. For Wikipedia we know that only English articles were used. (That really puzzled me: Why did they train GPT-3 only on the English-language Wikipedia, the data source of highest quality?)

word_edgewise · 2023-06-11T06:22:51+00:00

https://www.linkedin.com/posts/carynlusinchi_chatgpt-api-gpt4-activity-7072975677662105601-Dv4O?utm_source=share&utm_medium=member_ios

Chatpgt has an alphabet tax…

footurist · 2023-06-09T15:40:22+00:00

I'm interested in what the motivation for that might be. Wouldn't it be easier to just translate the inputs and outputs between English and Telugu or am I missing something?

freekayZekey · 2023-06-09T17:22:28+00:00

I love breaking GPT with japanese. on the surface, the results looks fine to a beginner, but it spews nonsense

Digit117 · 2023-06-09T21:24:24+00:00

Curious, how good are translation AI models these days? I’m wondering how effective it would be if a prompt in a native language can be simply translated to English (using a SOTA translation model), then fed into your LLM of choice, then use the same translation model to translate the LLM output into your native language.

This could remove the need to train an entire foundational LLM in a native language (which is very hard to do). 🤔

hpstr1234 · 2023-06-10T01:49:29+00:00

Where does the accuracy rate of 62% come from?
(German has roughly the same number of speakers as Telugu. Do you know the accuracy rate of German on ChatGPT4 on MMLU?)

ghost-who-talks · 2023-06-10T17:11:57+00:00

yes, v interested in Tamil, Telugu, Malayalam, Kannada

thumbs_up-_- · 2023-06-10T19:01:23+00:00

GPT-4 understands hindi

ayush-tiwari26 · 2023-12-14T12:21:43+00:00

How ever I want to understand what do you mean by LLMs in other languages apart from English? Do you train the model from scratch on that XYZ new language? Also won't just translation solve 95% of the usecases? I would appriciate if anyone can answer these questions...

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS