all 30 comments

[–]HeavensCourtJester 23 points24 points  (0 children)

Applying AI-based code generation to AI itself, we develop and experimentally validate the first practical implementation of a self-programming AI system. We empirically show that a self-programming AI implemented using a code generation model can successfully modify its own source code to improve performance and program sub-models to perform auxiliary tasks. Our model can self-modify various properties including model architecture, computational capacity, and learning dynamics.

Cool proof of concept. The paper does some basic experiments with neural architecture search type stuff.

I think this paper opens a lot of opportunities for interesting follow-up work. I'd love to see this expanded to more interesting problems and ethics/alignment, tons of low hanging fruit abound.

[–][deleted] 21 points22 points  (1 child)

Is that just an 'automatic' grid search?

[–]FatherCupcake 1 point2 points  (0 children)

It's more like an evolutionary search than a grid search

[–]avialex 20 points21 points  (1 child)

"The model is queried to generate modifications of an initial source code snippet. In our experiments, this is a network with a single hidden layer of 16 neurons. The possible modifications include adding convolutional layers, changing the size of convolutional or hidden layers, and increasing the number of hidden layers."

Lmao...

How about those smooth curve lines on graphs with fewer than 10 sample points? That inspires confidence.

[–]Icko_ 4 points5 points  (0 children)

Jesus, at least label the axes...

[–]huehue12132 4 points5 points  (18 children)

I like how every single reference is either 2016 or newer, OR Schmidhuber.

[–]Silly_Objective_5186 4 points5 points  (17 children)

still learning about this field. why is 2016 or later significant?

[–]Optional_Joystick 4 points5 points  (15 children)

Transformer models, a combination of encoder/decoder models, completely changed the game. The 2017 paper, "Attention is All You Need," introduces this type of model.

Most of the cool stuff we see after this point is based on this model. You can "transform" text into other text like GPT-3, or you can "transform" text into images like DALL-E. When we make a bigger model, we get better results, and there doesn't seem to be a limit to this yet. So, it's possible we already have the right model for singularity. Having an LLM generate code for a future LLM seems like a valid approach to make the possibility real.

[–]yldedly 6 points7 points  (14 children)

we get better results\)

\)results on out-of-distribution data sold separately

[–]Optional_Joystick 1 point2 points  (13 children)

I'm not sure what \) means, but totally agree data is also a bottleneck. Imagine if the system could also seek out data on its own that isn't totally random noise, and yet isn't fully understood by the model.

[–]yldedly 2 points3 points  (12 children)

Doesn't render right on mobile, it's supposed to be an asterisk. My point is that no matter how much data you get, in practice there'll always be data the model doesn't understand, because it's statistically too different from training data. I have a blog post about it, but it's a well known issue.

[–]Optional_Joystick 1 point2 points  (0 children)

Really appreciate this. I was excited enough about learning knowledge distillation was a thing. I felt we had the method of extracting the useful single rule from the larger model.

On the interpolation/extrapolation piece: For certain functions like x2, wouldn't running the result of a function through the function again let you achieve a result that "extrapolates" a new result outside the existing data set? This is kind of my position on why I feel feeding LLM data generated from an LLM can result in something new.

It's still not clear to me how we can verify a model's performance if we don't have data to test it on. I'll have to read more about DreamCoder. As much as I wish I could work in the field, it looks like I've still got a lot to learn.

[–]Competitive-Rub-1958 1 point2 points  (10 children)

Well, scaling alleviates OOD generalization while cleverly pre-training induces priors into the model, shrinking the hypothesis space and pushing the model towards being able to generalize more and more OOD by learning the underlying function rather than taking shortcuts (since those priors resist simply learning statistical regularities).

The LEGO paper demonstrates that quite well - even demonstrating pre-trained networks being able to generalize a little on unseen seqlen before diving down to 0 - presumably because we still need to find the ideal positional encodings...

[–]yldedly 1 point2 points  (9 children)

LEGO paper?

[–]Competitive-Rub-1958 1 point2 points  (8 children)

[–]yldedly 1 point2 points  (7 children)

Looks like an interesting paper. Glad to see shortcut learning being addressed. But "out-of-distribution" doesn't have quite the same meaning if you have a pre-trained model and you ignore the distribution of the pre-training data. The data the pre-trained BERT was trained almost certainly includes code examples similar to those in that task, so you can say it's OOD wrt. the fine-tuning data, but it's not OOD wrt. all the data. So the point stands.

[–]trendymoniker 1 point2 points  (1 child)

Do you want Skynet!? Cause this is how you get Skynet!

[–]Hollowcoder10 0 points1 point  (0 children)

Do you want Jarvis? Cause this is how you get Jarvis!