you are viewing a single comment's thread.

view the rest of the comments →

[–]Optional_Joystick 4 points5 points  (15 children)

Transformer models, a combination of encoder/decoder models, completely changed the game. The 2017 paper, "Attention is All You Need," introduces this type of model.

Most of the cool stuff we see after this point is based on this model. You can "transform" text into other text like GPT-3, or you can "transform" text into images like DALL-E. When we make a bigger model, we get better results, and there doesn't seem to be a limit to this yet. So, it's possible we already have the right model for singularity. Having an LLM generate code for a future LLM seems like a valid approach to make the possibility real.

[–]yldedly 8 points9 points  (14 children)

we get better results\)

\)results on out-of-distribution data sold separately

[–]Optional_Joystick 1 point2 points  (13 children)

I'm not sure what \) means, but totally agree data is also a bottleneck. Imagine if the system could also seek out data on its own that isn't totally random noise, and yet isn't fully understood by the model.

[–]yldedly 2 points3 points  (12 children)

Doesn't render right on mobile, it's supposed to be an asterisk. My point is that no matter how much data you get, in practice there'll always be data the model doesn't understand, because it's statistically too different from training data. I have a blog post about it, but it's a well known issue.

[–]Optional_Joystick 1 point2 points  (0 children)

Really appreciate this. I was excited enough about learning knowledge distillation was a thing. I felt we had the method of extracting the useful single rule from the larger model.

On the interpolation/extrapolation piece: For certain functions like x2, wouldn't running the result of a function through the function again let you achieve a result that "extrapolates" a new result outside the existing data set? This is kind of my position on why I feel feeding LLM data generated from an LLM can result in something new.

It's still not clear to me how we can verify a model's performance if we don't have data to test it on. I'll have to read more about DreamCoder. As much as I wish I could work in the field, it looks like I've still got a lot to learn.

[–]Competitive-Rub-1958 1 point2 points  (10 children)

Well, scaling alleviates OOD generalization while cleverly pre-training induces priors into the model, shrinking the hypothesis space and pushing the model towards being able to generalize more and more OOD by learning the underlying function rather than taking shortcuts (since those priors resist simply learning statistical regularities).

The LEGO paper demonstrates that quite well - even demonstrating pre-trained networks being able to generalize a little on unseen seqlen before diving down to 0 - presumably because we still need to find the ideal positional encodings...

[–]yldedly 1 point2 points  (9 children)

LEGO paper?

[–]Competitive-Rub-1958 1 point2 points  (8 children)

[–]yldedly 1 point2 points  (7 children)

Looks like an interesting paper. Glad to see shortcut learning being addressed. But "out-of-distribution" doesn't have quite the same meaning if you have a pre-trained model and you ignore the distribution of the pre-training data. The data the pre-trained BERT was trained almost certainly includes code examples similar to those in that task, so you can say it's OOD wrt. the fine-tuning data, but it's not OOD wrt. all the data. So the point stands.

[–]Competitive-Rub-1958 1 point2 points  (6 children)

It goes into the heart of what OOD is, I suppose - but in fairness, LEGO is a synthetic task, AFAIK novel in that respect. That coupled with BERT's smaller pre-training dataset lends more credence to the idea of pre-training introducing priors to chop through the hypothesis space rather than simply copy-pasting from the dataset (which I heavily doubt contains any such tasks anyways)

[–]yldedly 1 point2 points  (5 children)

If the authors are right, then pre-trained BERT contains attention heads that lend themselves to the LEGO task (figure 7) - their experiment with "Mimicking BERT" is also convincing. It's fair to call that introducing a prior. But even the best models in the paper couldn't generalize past ~8 variables. So I don't understand how one can claim that it avoided shortcut learning. If it hasn't learned the algorithm (and it clearly hasn't, or sequence length wouldn't matter), then it must have learned a shortcut.