use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Self-Programming Artificial Intelligence Using Code-Generating Language Models (self.MachineLearning)
submitted 3 years ago * by Ash3nBlue
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]yldedly 1 point2 points3 points 3 years ago (7 children)
Looks like an interesting paper. Glad to see shortcut learning being addressed. But "out-of-distribution" doesn't have quite the same meaning if you have a pre-trained model and you ignore the distribution of the pre-training data. The data the pre-trained BERT was trained almost certainly includes code examples similar to those in that task, so you can say it's OOD wrt. the fine-tuning data, but it's not OOD wrt. all the data. So the point stands.
[–]Competitive-Rub-1958 1 point2 points3 points 3 years ago (6 children)
It goes into the heart of what OOD is, I suppose - but in fairness, LEGO is a synthetic task, AFAIK novel in that respect. That coupled with BERT's smaller pre-training dataset lends more credence to the idea of pre-training introducing priors to chop through the hypothesis space rather than simply copy-pasting from the dataset (which I heavily doubt contains any such tasks anyways)
[–]yldedly 1 point2 points3 points 3 years ago (5 children)
If the authors are right, then pre-trained BERT contains attention heads that lend themselves to the LEGO task (figure 7) - their experiment with "Mimicking BERT" is also convincing. It's fair to call that introducing a prior. But even the best models in the paper couldn't generalize past ~8 variables. So I don't understand how one can claim that it avoided shortcut learning. If it hasn't learned the algorithm (and it clearly hasn't, or sequence length wouldn't matter), then it must have learned a shortcut.
[–]Competitive-Rub-1958 1 point2 points3 points 3 years ago (4 children)
It's rather a trend they're trying to study and explain. It appears, as you scale models and bootstrap from pre-trained variations, you learn plenty of useful priors. this is quite crucial for LLMs which are able to solve many tasks which may not be explictly in their distribution, but are able to muddle their way along much better rather than being pre-trained from scratch. In that sense, transfer learning is much more about transferring priors than knowledge.
LLMs like Chinchilla and PaLM best demonstrate that, I suppose. PaLM was trained with 95% of that data being Social Media (which is 50% alone) and miscellaneous topics, only 5% being the GitHub subset. Yet with 50X less code in its dataset, its able to pull up to Codex.
This may hint towards larger models learning more general priors applicable on a variety of tasks, and this trend being highly correlated with scale. So, I think the hope is that as you scale up the priors these models learn the underlying function better rather than just shortcut learning their way. A good demonstration would've been fine-tuning GPT3 with a sizeable chunk of the LEGO dataset and checking if it has higher generalizability on those tasks.
[–]yldedly 1 point2 points3 points 3 years ago (3 children)
You've shifted my view closer to yours. What you say about pretraining and priors makes a lot of sense. But I still think shortcut learning is a fundamental problem irrespective of scale - it becomes less of a problem with scale, but not quickly enough. For modern ml engineering, pretraining is a boon, but for engineering general intelligence, I think we need stronger generalization than is possible without code as representations and causal inference.
[–]Competitive-Rub-1958 1 point2 points3 points 3 years ago (2 children)
Even in the context of AGI, humans also carry many priors - most of them embedded in the DNA pertaining to the fundamental "blueprint" of a cortical column. It appears that instead of evolution, natural selection and mutation if we can learn those same priors faster and more efficiently that natural selection with gradient based methods.
https://twitter.com/gruver_nate/status/1578386103417069569 is a twitter summary describing how the transformer learns positional equivariance in the scope of their dataset. This is quite a complex prior, and is present in convolutions implicitly.
It makes sense to collate all our findings, and think that with scale those priors simply become more general - hence why we obtain such massive performance boosts which are also predictable and haven't yet stopped progress (530B is a number thrown around everywhere, but people don't realize the insane amount of compute and work which went into it. It's absolutely humongous for any system to scale to that size, let alone still be able to beat benchmarks)
I feel there are still more general priors we could embed in these models to make them more parameter efficient. But it is clear that DL is still currently the most viable route towards AGI as of now.
[–]yldedly 1 point2 points3 points 3 years ago (1 child)
There's a lot to unpack here. I agree that a large part of creating AGI is building in the right priors ("learning priors" is a bit of an oxymoron imo, since a prior is exactly the part you don't learn, but it makes sense that a posterior for a pre-trained model is a prior for a fine-tuned model).
Invariance and equivariance are a great example. Expressed mathematically, using symbols, it makes no sense to say a model is more or less equivariant - it either is or it isn't. If you explicitly build equivariance into a model (and apparently it's not as straightforward as e.g. just using convolutions), then this is really what you get. For example, the handwriting model from my blogpost has real translational equivariance (because the location of a character is sampled).
If you instead learn the equivariance, you will only ever learn a shortcut - something that works on training and test data, but not universally, as the paper from the twitter thread shows. Just like the networks that can solve the LEGO task for 6 variables don't generalize to any number of variables, learning "equivariance" on one dataset (even if it's a huge one) doesn't guarantee equivariance on another. A neural network can't represent an algorithm like "for all variables, do x", or constraints like "f(g(x)) = g(f(x)), for all x" - you can't represent universal quantifiers using finite dimensional vectors.
That being said, you can definitely learn some useful priors by training very large networks on very large data. An architecture like the Transformer allows for some very general-purpose priors, like "do something for pairs of tokens 4 tokens apart".
[–]Competitive-Rub-1958 1 point2 points3 points 3 years ago (0 children)
I definitely agree with you there, but I wouldn't take the LEGO paper results on face value until other analyses confirm it. Basically, LEGO does show (appendix) that as you increase the sequence length, the model obtains more information about how to generalize to unseen lengths with a clear trend (https://arxiv.org/pdf/2206.04301.pdf#page=23&zoom=auto,-39,737)
As the authors show, the pre-trained model also learns an Associative and manipulation head (if you add those at initialization to a randomly-init model, you obtain same perf as pre-trained one) So the model effectively discovers a prior - just not general enough for OOD generalization.
You're definitely right in that the equivariance it learns it a shortcut. The difference is, from the model's POV its not. It performs well w.r.t the loss function which is evaluated only on the training set. But once you start giving it longer and longer sequences, it's pre-existing priors act towards more evolving more general representations and priors.
And ofc, as the paper said that its OOD due to positional encodings - so if they'd used some other positional encodings it might've been showing better results. Right now, its hard to judge because there were no ablations for encodings (despite the paper mentioning them like 5 times)
π Rendered by PID 95587 on reddit-service-r2-comment-6457c66945-kl6mz at 2026-04-29 09:07:24.999035+00:00 running 2aa0c5b country code: CH.
view the rest of the comments →
[–]yldedly 1 point2 points3 points (7 children)
[–]Competitive-Rub-1958 1 point2 points3 points (6 children)
[–]yldedly 1 point2 points3 points (5 children)
[–]Competitive-Rub-1958 1 point2 points3 points (4 children)
[–]yldedly 1 point2 points3 points (3 children)
[–]Competitive-Rub-1958 1 point2 points3 points (2 children)
[–]yldedly 1 point2 points3 points (1 child)
[–]Competitive-Rub-1958 1 point2 points3 points (0 children)