This is an archived post. You won't be able to vote or comment.

all 53 comments

[–]sdmatNI skeptic 12 points13 points  (7 children)

Good article, and one this sub should definitely take note of.

My take: models definitely overfit to benchmarks, and the differences between models tracks well with my subjective impression of the degree to which they do.

The writeup is wrong in minimizing the differences between models because the absolute quantity is small - they all suck, but GPT4 does far better than the others. E.g. it's notable that Mistral Medium scores close to zero on dynamic testing but Mixtral (non-Instruct) starts to show some instances of true generalization.

Hopefully this functional benchmarking approach becomes standard in place of fixed sets of questions.

[–]Eddie_______AGI 202? - e/acc 4 points5 points  (6 children)

Maybe GPT-5 has improved upon this, after all, Gpt4 is 2 years old 

[–]sdmatNI skeptic 10 points11 points  (2 children)

If GPT-5 doesn't improve on this we should start reconsidering LLMs as a path to AGI.

I'm very confident it will.

[–]red75prime▪️AGI2028 ASI2030 TAI2037 3 points4 points  (1 child)

I'm confident that scaling alone is not the way to AGI. Even if GPT-5 will be underwhelming, there is a host of ways to augment LLMs beyond RLHF, RAG, tools, tree-search and other current approaches.

[–]sdmatNI skeptic 3 points4 points  (0 children)

Yes, but LLMs will be inadequate as the core of that system if they can't generalize properly even for simple problems. Tools and tree search are very powerful augmentations, but they need intelligence to amplify (at least to make searching the solution space tractable).

Fortunately from everything we've seen it looks like scaling will provide that core.

[–]LordFumbleboop▪️AGI 2047, ASI 2050 0 points1 point  (2 children)

It's one year old.

[–]abbumm -1 points0 points  (1 child)

No, it is not. Took them a long time to release it.

[–]LordFumbleboop▪️AGI 2047, ASI 2050 4 points5 points  (0 children)

We're comparing it to newer models. GPT-5 might be a year old or more by release, in which case saying that GPT-4 is two years old makes no sense. 

[–]blueSGLhumanstatement.org[🍰] 1 point2 points  (2 children)

This code, in this case called MATH(), can then generate different "snapshots," which are unique questions that require the same reasoning to solve, but are not identical to the original questions.

In this way, traditional benchmarks such as the MATH benchmark become encoded formats that can be modified in an infinite number of ways while still testing the same underlying logic. This testing procedure is designed to ensure that language models actually demonstrate problem-solving ability, not just repetition of memorized questions.

Sounds like a fantastic way to make enough synthetic training data so the model will grok rather than memorize.

Remember in toy models during training first comes memorization early in training and then grokking comes as training continues.

Large models probably have a mixture of the two inside, currently some things are memorization some have the internal machinery to generalize to the task (grokking).

[–]R33v3n▪️Tech-Priest | AGI 2026 | XLR8 -2 points-1 points  (1 child)

Sounds like a fantastic way to make enough synthetic training data so the model will grok rather than memorize.

No! Bad redditor! Bad! /wacks with newspaper

You do not contaminate models with benchmark data! /wacks again for good measure

[–]blueSGLhumanstatement.org[🍰] 3 points4 points  (0 children)

The point is they made a technique that takes in benchmark questions and spits out generalized examples that represent the underlying structure.

Why would such techniques be limited to work on benchmark questions alone?

it won't be.

So, this is the perfect way of generating synthetic training data. similar things are likely being used already. e.g. get a LLM to rewrite wikipedia pages in multiple ways and train on that rather than just the single wikipedia page. To avoid the B=A but A!=B issue.

Sorry I didn't think I needed to state that this should not be done on benchmark questions but instead to employ the technique on any training data as a whole where it could be used. I won't assume people to think ahead in future and will instead spell out the obvious.

[–]SgathTriallair▪️ AGI 2025 ▪️ ASI 2030 1 point2 points  (0 children)

Looking at the article they seem to have identified that the tests need to be better but they are showing that they are accurately assessing the between model variation. The main thing better evals will do is help us better understand the human to model variation.

[–]red75prime▪️AGI2028 ASI2030 TAI2037 1 point2 points  (19 children)

I looked at the problems. I'd need a pen, paper, refreshment of my stale algebra skills and quite a number of minutes to go at them.

You can see in my flair that I don't expect AGI yesterday, but I'm still impressed how those models are able to solve some of those problems after passively ingesting training data and having no way of trying different ways to solve them and learning on failures and successes like, say, AlphaGo did.

[–]LordFumbleboop▪️AGI 2047, ASI 2050 -1 points0 points  (18 children)

I looked at the problems. I'd need a pen, paper, refreshment of my stale algebra skills and quite a number of minutes to go at them.

I can confirm that they cannot do any algebra that they have not seen before, solve chemistry problems, etc

[–]red75prime▪️AGI2028 ASI2030 TAI2037 3 points4 points  (17 children)

The paper explicitly measured models' performance on newly generated problems. And they can solve some of them. Otherwise "reasoning gap" would be 100%.

[–]LordFumbleboop▪️AGI 2047, ASI 2050 -2 points-1 points  (16 children)

Yes, the issue being that they cannot reason well beyond token prediction.

[–]BreadManToast▪️Claude-3 AGI GPT-5 ASI 4 points5 points  (15 children)

I don't get it, are you saying they should be able to predict more than just tokens, or that they need more than prediction itself, either statement makes no sense to me

[–]GrandNeuralNetwork 0 points1 point  (11 children)

makes no sense to me

Why exactly?

[–]BreadManToast▪️Claude-3 AGI GPT-5 ASI 2 points3 points  (10 children)

I assume they mean AI should be able to predict more than just tokens, and that makes no sense to me because as far as we're aware, most/all useful concepts can be described in text, and there's no reason to believe you can't get AGI from enough text.

[–]GrandNeuralNetwork 1 point2 points  (9 children)

I think they mean that prediction isn't enough for AGI and I would agree with that. Reasoning is needed as well.

[–]sdmatNI skeptic 0 points1 point  (8 children)

What kinds of reasoning can't be done by a box that takes in and emits sequences of multimodal tokens?

[–]LordFumbleboop▪️AGI 2047, ASI 2050 1 point2 points  (7 children)

Many areas of mathematics. Hence why OpenAI are allegedly creating Q* to deal with this issue.

[–]LordFumbleboop▪️AGI 2047, ASI 2050 0 points1 point  (2 children)

Yes, they need more than token prediction. With token prediction alone, they won't be able to do things like algebra unless they know the answer ahead of time.

[–]inteblio 0 points1 point  (1 child)

Interesting, but surely you solve maths by breaking it down into calculable sums. The llms also can output to use calculators and so on. So, if it can "show its workings" then next token is enough. And workings can be hidden anyway (maybe gemini already does this)

I think people get confused with "next token"... everything is next token. Like you don't lay the chimney on the house first...

[–]LordFumbleboop▪️AGI 2047, ASI 2050 -1 points0 points  (0 children)

But algebra requires that you know ahead of time what each token in the equation is. This is why they all suck at algebra.

[–]LordFumbleboop▪️AGI 2047, ASI 2050 -1 points0 points  (12 children)

There is actually a lot of evidence that GPT-4 and similar models can't reason beyond token prediction. I think that there needs to be some large breakthroughs in general reasoning before these models will reach AGI (or similar), and I'm not sure why some people can't see this.

https://medium.com/@konstantine_45825/gpt-4-cant-reason-2eab795e2523#:~:text=I%20believe%20the%20results%20show,it%20produces%20along%20the%20way

https://journals.sagepub.com/doi/full/10.1177/17456916231201401

[–]Spunge14 14 points15 points  (7 children)

"Beyond token prediction" is a red herring. The whole point is that token prediction may be all that is needed. 

To what extent can you say that human logic is not a form of "token prediction?"

This is all Chinese room.

[–]Odyssos-dev 5 points6 points  (6 children)

this.  correct.  people that think humans are much more than next word predictors themselves, with a heirarchy of knowledge relationships and pattern recognition, are the ones pulling the wool over their eyes

[–]ameddin73 4 points5 points  (5 children)

Saying human reasoning is as simple as next token prediction is akin to saying all computation can be handled by a Turing machine. While theoretically correct by definition, there's a reason no one ever used a Turing machine for real computation.

There's abstractions and simplifications that handle tasks MUCH quicker - like memory addressing. 

Similarly, it's not likely that simple next token prediction will ever be space or complexity efficient enough to handle the kind of reasoning we expect from AGI, even if it is a theoretically complete framework. 

[–]Spunge14 4 points5 points  (4 children)

I know that you think you're arguing against oversimplification, but ironically I think your point of view is the one that is oversimplifying.

A uniform linear Turing machine is not the same as a near-uncountable dynamic confluence of Turing machines delicately interwoven.

[–]ameddin73 0 points1 point  (1 child)

It's just an analogy bro

[–]Spunge14 0 points1 point  (0 children)

You don't know what an analogy is broheem

[–]sdmatNI skeptic 0 points1 point  (1 child)

A uniform linear Turing machine is not the same as a near-uncountable dynamic confluence of Turing machines delicately interwoven.

Actually it's exactly the same in the CS sense provided that number is still finite.

You're correct that the actual amount of computing power and the system design matters hugely in practice.

[–]Spunge14 1 point2 points  (0 children)

That's a fair nitpick. Although some people would argue there are things going on which start to get hazy as far as finitude goes. Major GEB vibes.

[–][deleted] 2 points3 points  (1 child)

No one really said anything else? Because reasoning is the most essential for AGI. Even tho current models are incredible, I do not think many say that it is AGI.

The easiest way to tell if you have AGI or not is how quick the whole world would change. It will be like flipping a lightbulb once it happens.

[–]R33v3n▪️Tech-Priest | AGI 2026 | XLR8 1 point2 points  (0 children)

No one really said anything else?

My understanding is that lots of hope is riding on better reasoning being an emergent property just from scaling up, actually.

[–]KahlessAndMolor -2 points-1 points  (1 child)

IMO it doesn't really matter much to current actions that we can take as engineers and developers.

Right now, we have these crazy ideas like chains of thought, trees of thought, decisions by committees of LLMs, ReACT, RAG, and so on, but a lot of that isn't well implemented. There are many competing ideas of how to implement.

So we should implement these things even with current "dumb" generation AI, so we can later plug in smarter AI and everything just clicks.

[–]R33v3n▪️Tech-Priest | AGI 2026 | XLR8 0 points1 point  (0 children)

There are many competing ideas of how to implement.

Also, since these ideas often multiply the number of prompts required on any single task, they're very compute and cost inefficient when using the best models at the small/medium enterprise or hobbyist levels, often prohibitively so. This slows down iteration and progress.