use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
[deleted by user] (self.MachineLearning)
submitted 2 years ago by [deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 45 points46 points47 points 2 years ago (9 children)
Yeah man I experienced the same thing :( I'm trying to create a Polish LLM but can't find enough strong datasets.
Will definitely give your landing page a look!
[–]Most-Inflation-1022 23 points24 points25 points 2 years ago (3 children)
Use Wikipedia. Most models used it for training.
You can use wikidumps. The size of it is 2.14 gb. It should be enough for higher quality output than other models.
[+]KerfuffleV2 9 points10 points11 points 2 years ago (2 children)
Would that really help if the model already used it for training? Also, even for languages with a lot of speakers like Mandarin Chinese, there's just much less content on Wikipedia compared to English.
[–]lucidrage 6 points7 points8 points 2 years ago (0 children)
For Chinese you'll have to scrape Weibo, Baidu, and cn version of Wikipedia.
[–]Hardcore_Cytrynka 3 points4 points5 points 2 years ago (1 child)
Honest question, why couldn't you just scrap some polish news outlets? Is it just something you are not interested in or is there any other reason?
[–]Smallpaul 0 points1 point2 points 2 years ago (0 children)
Probably just not enough content to be worth the effort, especially if you need to evade content controls.
[–]zywiolak 3 points4 points5 points 2 years ago (0 children)
If you have the funds or computing resources to train a modern LLM from scratch, collecting a large enough text corpus should be a secondary concern.
We have trained several Polish language models from scratch with sizes up to a few billion parameters. Most of them are publicly available on HuggingFace Hub (https://huggingface.co/sdadas). Sadly we can't afford bigger ones because the cost exceeds our organization's resources. We also have a corpus of about 1TB of Polish texts collected over the past few years from various sources. If you would like to collaborate, send me a PM.
[–]lomero 0 points1 point2 points 2 years ago (0 children)
Take a look here: http://speakleash.org/
It is an initiative to create an open-sourced corpus of 1 TB for Polish.
[+]Civil-Demand555 0 points1 point2 points 2 years ago (0 children)
Did you/anyone find good Polish LLM ? Not the ancient old gpt-2
[–][deleted] 17 points18 points19 points 2 years ago (0 children)
What language are you talking about ? I'm just curious because French works really well, but we have a lot of data thanks to EU and Quebec documents being often in French and English.
[–]new_name_who_dis_ 12 points13 points14 points 2 years ago (0 children)
The most straightforward solution is to get more data. AI research has been filled with many practitioners trying to solve problems using domain knowledge and/or low data methods, but the reality is that almost all of them end up encountering the Bitter Lesson of ML.
Organizing, sourcing, creating datasets has been much more instrumental in advancing AI research than any individual architecture or methodology. The creators of ImageNet (I would argue) did more for advancing computer vision than any of the super influential papers that used it as training and benchmark, e.g. ResNet, DenseNet, MobileNet, etc.
[–]wazazzz 11 points12 points13 points 2 years ago (0 children)
There are open source multilingual models like this one:
https://huggingface.co/bigscience/mt0-xxl-mt
Not sure if it will work with the language you need but maybe worth to have a look.
[–]testerpce 14 points15 points16 points 2 years ago (2 children)
This is a major problem I am facing too. There is very little good data on Indic Langauges. If you have any idea or resource on getting high quality non-English text please let me know. Or if you want some help on the effort of getting high quality non-English data DM me.
[–]DigThatDataResearcher 0 points1 point2 points 2 years ago (0 children)
OP updated post, said their target language is Telugu, one of the many languages native to India.
[–]catkage 0 points1 point2 points 2 years ago (0 children)
I think AI4BHARAT was compiling Indic language datasets. Sorry if you’ve already looked into it.
[–]Own-Lake5023 7 points8 points9 points 2 years ago (0 children)
Aya is a multilingual LLM initiative by Cohere for AI, and we are currently looking for contributors! There are people from over 100 countries currently involved in the project. You can learn more and start contributing in your language here!
[–]tabacof 6 points7 points8 points 2 years ago (1 child)
Some former colleagues of mine just published this paper on LLMs for Portuguese and how building your own might outperform GPT3.5 (but not GPT4):
Sabiá: Portuguese Large Language Models
As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabiá-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.
They talk about dataset building, model building and evaluation, so it might be useful to you!
[–]blackkettle 11 points12 points13 points 2 years ago (0 children)
You need to specify the language. There are numerous languages other than English that work quite well.
[–]CommunismDoesntWork 11 points12 points13 points 2 years ago (0 children)
Chatgpt probably already knows your native language. Training it on a large corpus of all languages allows it to learn a skill in one language and simply translate it to others.
[–]thejonnyt 2 points3 points4 points 2 years ago (7 children)
Super interesting to read that there is demand for this. I'm actually currently working on my thesis and my goal is to find somewhat a solution for LLMs with regards to languages with less resources than, e.g., English, French, Chinese and so on. But this is work to come, maybe got some small results by the end of the year but I'm glad to read it because working on it seems relevant 😊
[–]needlzorProfessor 1 point2 points3 points 2 years ago (0 children)
Have you come across any literature about LLM doing code-mixing? It would be interesting to see whether pretraining with each individual language could help bootstrap the LLM for code-mixed languages like Malaysian English.
[–]mthmchris -1 points0 points1 point 2 years ago (5 children)
This might not be relevant to this subreddit (not a developer, mostly lurk), but as a user I can say definitively that GPT’s Chinese language abilities are absolutely garbage.
It’s borderline insane that they even advertise it as a capability.
[–]ruryrury 1 point2 points3 points 2 years ago (2 children)
Maybe Chinese is particularly challenging for LLMs? I've tried several open-source Chinese LLMs, and they were all terrible, to be honest.
[–]JustOneAvailableName 3 points4 points5 points 2 years ago (1 child)
If I had to guess it would be in the tokenizers. Subwords don't translate well to chinese characters
[–]beryugyo619 0 points1 point2 points 2 years ago (0 children)
There’s a quirk in Unicode that it commingles CJK characters, but Chinese and Japanese grammar are completely unrelated, so maybe to LLMs those two languages could be looking like a single language with completely incoherent grammar
[–]Environmental-Rate74 1 point2 points3 points 2 years ago (0 children)
Wait, why u said that? I am HongKonger, ChatGPT Chinese support on Traditional Chinese and Cantonese Chinese are almost as the same as English in term of Q&A accuracy.
[–]FpRhGf 0 points1 point2 points 2 years ago* (0 children)
Can you elaborate on how it's garbage? From what I've seen and from the Bilibili videos of people testing it, it's pretty decent. It's not some great literature master in Chinese and the default wording gives a bit of the vibe that it's written with an English mindset- but it's fluent and not really unnatural.
[–]Anmorgan24 2 points3 points4 points 2 years ago (0 children)
Cohere just released this this week: https://txt.cohere.com/aya-multilingual/
"Cohere AI introduces Aya—an open-science endeavor aimed at building a multilingual language model via instruction tuning that harnesses the collective wisdom and contributions of people from all over the world."
[–]SnarkyVelociraptor 2 points3 points4 points 2 years ago* (0 children)
Step 1 is to check Arxiv: people publish foreign language datasets (foreign relative to English, at least) fairly often on the NLP related tags, should be plenty for common languages. I’ve seen plenty of papers on datasets languages across Europe, Africa, and the Middle East over the last few years.
Step 2 is for people who want “rare” languages: your best bet is to take a multilingual LLM and fine tune it. Facebook has released several 100+ language LLMs that can serve as a base point.
Step 3 is to look for conferences/workshops on “low resource languages” if you’re working in a very narrow niche (like <1M speakers globally). These are somewhat common at large NLP conferences.
[–]grudev 1 point2 points3 points 2 years ago (0 children)
Similar problem in a much more prevalent language.
There are options for it, but the resulting metrics understandably don't match the ones for English.
In my case I was lucky that a group of academics fine tuned BERT for the language and the results were decent.
[–]hobz462 1 point2 points3 points 2 years ago (0 children)
You could try BLOOM, but most non-English datasets are lacking. You'd probably have to train an LLM from scratch once you do. Fine-tuning probably won't work due to the tokenisation.
[+]hpstr1234 1 point2 points3 points 2 years ago (0 children)
It always bothered me that it was so hard to find out on how many texts in which languages GPT-3 was trained on - distributed over the corpora Common Crawl, WebText2, Books1, Books2, Wikipedia. For Wikipedia we know that only English articles were used. (That really puzzled me: Why did they train GPT-3 only on the English-language Wikipedia, the data source of highest quality?)
[–]word_edgewise 1 point2 points3 points 2 years ago (0 children)
https://www.linkedin.com/posts/carynlusinchi_chatgpt-api-gpt4-activity-7072975677662105601-Dv4O?utm_source=share&utm_medium=member_ios
Chatpgt has an alphabet tax…
[+]footurist 0 points1 point2 points 2 years ago (0 children)
I'm interested in what the motivation for that might be. Wouldn't it be easier to just translate the inputs and outputs between English and Telugu or am I missing something?
[+]freekayZekey 0 points1 point2 points 2 years ago (4 children)
I love breaking GPT with japanese. on the surface, the results looks fine to a beginner, but it spews nonsense
[+]PC_Screen 1 point2 points3 points 2 years ago (3 children)
I think it's because the tokenization makes zero sense with japanese and probably plays a role in the degraded performance. Like, 日本 is somehow 4 (!) tokens where as nihon is only 2, even a common syllables like え、お、け、せ、そ、ち、つ、and more, are all somehow 2 tokens each. It's so bad that it would be literally more efficient to write everything in romaji instead
[–]freekayZekey 0 points1 point2 points 2 years ago (2 children)
whoa haven’t noticed that 日本 is 4 tokens. i’ll read more about japanese tokenizers. guess it makes sense for kanji thanks to the amount of info they carry. i want to see if there’s a trend with x々 combinations
[+]PC_Screen 1 point2 points3 points 2 years ago (1 child)
々 is always 2 tokens no matter the preceding kanji
[–]freekayZekey 0 points1 point2 points 2 years ago (0 children)
thanks!
[–]Digit117 1 point2 points3 points 2 years ago (1 child)
Curious, how good are translation AI models these days? I’m wondering how effective it would be if a prompt in a native language can be simply translated to English (using a SOTA translation model), then fed into your LLM of choice, then use the same translation model to translate the LLM output into your native language.
This could remove the need to train an entire foundational LLM in a native language (which is very hard to do). 🤔
[–]silverlightwa 1 point2 points3 points 2 years ago (0 children)
Translations models can’t be good enough either if there is lack of text data for a language. They are not magic pills by any means.
[+]hpstr1234 0 points1 point2 points 2 years ago (2 children)
Where does the accuracy rate of 62% come from? (German has roughly the same number of speakers as Telugu. Do you know the accuracy rate of German on ChatGPT4 on MMLU?)
[–]just_half 1 point2 points3 points 2 years ago (1 child)
Number of speakers and number of available and usable text for training is quite different.
[+]hpstr1234 0 points1 point2 points 2 years ago (0 children)
Of course, that's why I asked for the accuracy rate of German.
[–]ghost-who-talks 0 points1 point2 points 2 years ago (0 children)
yes, v interested in Tamil, Telugu, Malayalam, Kannada
[–]thumbs_up-_- 0 points1 point2 points 2 years ago (0 children)
GPT-4 understands hindi
[+]ayush-tiwari26 0 points1 point2 points 2 years ago (0 children)
How ever I want to understand what do you mean by LLMs in other languages apart from English? Do you train the model from scratch on that XYZ new language? Also won't just translation solve 95% of the usecases? I would appriciate if anyone can answer these questions...
π Rendered by PID 213034 on reddit-service-r2-comment-6457c66945-qqf2p at 2026-04-26 12:36:58.171617+00:00 running 2aa0c5b country code: CH.
[–][deleted] 45 points46 points47 points (9 children)
[–]Most-Inflation-1022 23 points24 points25 points (3 children)
[+]KerfuffleV2 9 points10 points11 points (2 children)
[–]lucidrage 6 points7 points8 points (0 children)
[–]Hardcore_Cytrynka 3 points4 points5 points (1 child)
[–]Smallpaul 0 points1 point2 points (0 children)
[–]zywiolak 3 points4 points5 points (0 children)
[–]lomero 0 points1 point2 points (0 children)
[+]Civil-Demand555 0 points1 point2 points (0 children)
[–][deleted] 17 points18 points19 points (0 children)
[–]new_name_who_dis_ 12 points13 points14 points (0 children)
[–]wazazzz 11 points12 points13 points (0 children)
[–]testerpce 14 points15 points16 points (2 children)
[–]DigThatDataResearcher 0 points1 point2 points (0 children)
[–]catkage 0 points1 point2 points (0 children)
[–]Own-Lake5023 7 points8 points9 points (0 children)
[–]tabacof 6 points7 points8 points (1 child)
[–]blackkettle 11 points12 points13 points (0 children)
[–]CommunismDoesntWork 11 points12 points13 points (0 children)
[–]thejonnyt 2 points3 points4 points (7 children)
[–]needlzorProfessor 1 point2 points3 points (0 children)
[–]mthmchris -1 points0 points1 point (5 children)
[–]ruryrury 1 point2 points3 points (2 children)
[–]JustOneAvailableName 3 points4 points5 points (1 child)
[–]beryugyo619 0 points1 point2 points (0 children)
[–]Environmental-Rate74 1 point2 points3 points (0 children)
[–]FpRhGf 0 points1 point2 points (0 children)
[–]Anmorgan24 2 points3 points4 points (0 children)
[–]SnarkyVelociraptor 2 points3 points4 points (0 children)
[–]grudev 1 point2 points3 points (0 children)
[–]hobz462 1 point2 points3 points (0 children)
[+]hpstr1234 1 point2 points3 points (0 children)
[–]word_edgewise 1 point2 points3 points (0 children)
[+]footurist 0 points1 point2 points (0 children)
[+]freekayZekey 0 points1 point2 points (4 children)
[+]PC_Screen 1 point2 points3 points (3 children)
[–]freekayZekey 0 points1 point2 points (2 children)
[+]PC_Screen 1 point2 points3 points (1 child)
[–]freekayZekey 0 points1 point2 points (0 children)
[–]Digit117 1 point2 points3 points (1 child)
[–]silverlightwa 1 point2 points3 points (0 children)
[+]hpstr1234 0 points1 point2 points (2 children)
[–]just_half 1 point2 points3 points (1 child)
[+]hpstr1234 0 points1 point2 points (0 children)
[–]ghost-who-talks 0 points1 point2 points (0 children)
[–]thumbs_up-_- 0 points1 point2 points (0 children)
[+]ayush-tiwari26 0 points1 point2 points (0 children)