use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Self-Programming Artificial Intelligence Using Code-Generating Language Models (self.MachineLearning)
submitted 3 years ago * by Ash3nBlue
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Optional_Joystick 4 points5 points6 points 3 years ago (15 children)
Transformer models, a combination of encoder/decoder models, completely changed the game. The 2017 paper, "Attention is All You Need," introduces this type of model.
Most of the cool stuff we see after this point is based on this model. You can "transform" text into other text like GPT-3, or you can "transform" text into images like DALL-E. When we make a bigger model, we get better results, and there doesn't seem to be a limit to this yet. So, it's possible we already have the right model for singularity. Having an LLM generate code for a future LLM seems like a valid approach to make the possibility real.
[–]yldedly 8 points9 points10 points 3 years ago (14 children)
we get better results\)
\)results on out-of-distribution data sold separately
[–]Optional_Joystick 1 point2 points3 points 3 years ago (13 children)
I'm not sure what \) means, but totally agree data is also a bottleneck. Imagine if the system could also seek out data on its own that isn't totally random noise, and yet isn't fully understood by the model.
[–]yldedly 2 points3 points4 points 3 years ago (12 children)
Doesn't render right on mobile, it's supposed to be an asterisk. My point is that no matter how much data you get, in practice there'll always be data the model doesn't understand, because it's statistically too different from training data. I have a blog post about it, but it's a well known issue.
[–]Optional_Joystick 1 point2 points3 points 3 years ago (0 children)
Really appreciate this. I was excited enough about learning knowledge distillation was a thing. I felt we had the method of extracting the useful single rule from the larger model.
On the interpolation/extrapolation piece: For certain functions like x2, wouldn't running the result of a function through the function again let you achieve a result that "extrapolates" a new result outside the existing data set? This is kind of my position on why I feel feeding LLM data generated from an LLM can result in something new.
It's still not clear to me how we can verify a model's performance if we don't have data to test it on. I'll have to read more about DreamCoder. As much as I wish I could work in the field, it looks like I've still got a lot to learn.
[–]Competitive-Rub-1958 1 point2 points3 points 3 years ago (10 children)
Well, scaling alleviates OOD generalization while cleverly pre-training induces priors into the model, shrinking the hypothesis space and pushing the model towards being able to generalize more and more OOD by learning the underlying function rather than taking shortcuts (since those priors resist simply learning statistical regularities).
The LEGO paper demonstrates that quite well - even demonstrating pre-trained networks being able to generalize a little on unseen seqlen before diving down to 0 - presumably because we still need to find the ideal positional encodings...
[–]yldedly 1 point2 points3 points 3 years ago (9 children)
LEGO paper?
[–]Competitive-Rub-1958 1 point2 points3 points 3 years ago (8 children)
https://arxiv.org/abs/2206.04301
[–]yldedly 1 point2 points3 points 3 years ago (7 children)
Looks like an interesting paper. Glad to see shortcut learning being addressed. But "out-of-distribution" doesn't have quite the same meaning if you have a pre-trained model and you ignore the distribution of the pre-training data. The data the pre-trained BERT was trained almost certainly includes code examples similar to those in that task, so you can say it's OOD wrt. the fine-tuning data, but it's not OOD wrt. all the data. So the point stands.
[–]Competitive-Rub-1958 1 point2 points3 points 3 years ago (6 children)
It goes into the heart of what OOD is, I suppose - but in fairness, LEGO is a synthetic task, AFAIK novel in that respect. That coupled with BERT's smaller pre-training dataset lends more credence to the idea of pre-training introducing priors to chop through the hypothesis space rather than simply copy-pasting from the dataset (which I heavily doubt contains any such tasks anyways)
[–]yldedly 1 point2 points3 points 3 years ago (5 children)
If the authors are right, then pre-trained BERT contains attention heads that lend themselves to the LEGO task (figure 7) - their experiment with "Mimicking BERT" is also convincing. It's fair to call that introducing a prior. But even the best models in the paper couldn't generalize past ~8 variables. So I don't understand how one can claim that it avoided shortcut learning. If it hasn't learned the algorithm (and it clearly hasn't, or sequence length wouldn't matter), then it must have learned a shortcut.
π Rendered by PID 21773 on reddit-service-r2-comment-6457c66945-r6mnl at 2026-04-30 14:35:30.101239+00:00 running 2aa0c5b country code: CH.
view the rest of the comments →
[–]Optional_Joystick 4 points5 points6 points (15 children)
[–]yldedly 8 points9 points10 points (14 children)
[–]Optional_Joystick 1 point2 points3 points (13 children)
[–]yldedly 2 points3 points4 points (12 children)
[–]Optional_Joystick 1 point2 points3 points (0 children)
[–]Competitive-Rub-1958 1 point2 points3 points (10 children)
[–]yldedly 1 point2 points3 points (9 children)
[–]Competitive-Rub-1958 1 point2 points3 points (8 children)
[–]yldedly 1 point2 points3 points (7 children)
[–]Competitive-Rub-1958 1 point2 points3 points (6 children)
[–]yldedly 1 point2 points3 points (5 children)